Search Results for author: Dimitris Papailiopoulos

Found 51 papers, 25 papers with code

CHAI: Clustered Head Attention for Efficient LLM Inference

no code implementations • 12 Mar 2024 • Saurabh Agarwal, Bilge Acun, Basil Hosmer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean Wu

We observe that there is a high amount of redundancy across heads on which tokens they pay attention to.

Paper
Add Code

How Well Can Transformers Emulate In-context Newton's Method?

no code implementations • 5 Mar 2024 • Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression.

In-Context Learning regression

Paper
Add Code

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

2 code implementations • 6 Feb 2024 • Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention.

In-Context Learning Language Modelling +1

172

Paper
Code

Looped Transformers are Better at Learning Learning Algorithms

1 code implementation • 21 Nov 2023 • Liu Yang, Kangwook Lee, Robert Nowak, Dimitris Papailiopoulos

Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al.

Paper
Code

Mini-Batch Optimization of Contrastive Loss

1 code implementation • 12 Jul 2023 • Jaewoong Cho, Kartik Sreenivasan, Keon Lee, Kyunghoo Mun, Soheun Yi, Jeong-Gwan Lee, Anna Lee, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

Contrastive learning has gained significant attention as a method for self-supervised learning.

Contrastive Learning Self-Supervised Learning

Paper
Code

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

no code implementations • 12 Jul 2023 • Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee

This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding.

Paper
Add Code

Teaching Arithmetic to Small Transformers

1 code implementation • 7 Jul 2023 • Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed.

Low-Rank Matrix Completion

Paper
Code

Prompted LLMs as Chatbot Modules for Long Open-domain Conversation

1 code implementation • 8 May 2023 • Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, Kangwook Lee

In this paper, we propose MPC (Modular Prompted Chatbot), a new approach for creating high-quality conversational agents without the need for fine-tuning.

Chatbot

Paper
Code

Cuttlefish: Low-Rank Model Training without All the Tuning

1 code implementation • 4 May 2023 • Hongyi Wang, Saurabh Agarwal, Pongsakorn U-chupala, Yoshiki Tanaka, Eric P. Xing, Dimitris Papailiopoulos

Cuttlefish leverages the observation that after a few epochs of full-rank training, the stable rank (i. e., an approximation of the true rank) of each layer stabilizes at a constant value.

Paper
Code

The Expressive Power of Tuning Only the Normalization Layers

no code implementations • 15 Feb 2023 • Angeliki Giannou, Shashank Rajput, Dimitris Papailiopoulos

Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks.

Paper
Add Code

Looped Transformers as Programmable Computers

1 code implementation • 30 Jan 2023 • Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, Dimitris Papailiopoulos

We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop.

In-Context Learning

Paper
Code

Transformers as Algorithms: Generalization and Stability in In-context Learning

2 code implementations • 17 Jan 2023 • Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak

We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i. i. d.

Generalization Bounds In-Context Learning +3

Paper
Code

PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

no code implementations • 6 Oct 2022 • Liu Yang, Jifan Zhang, Joseph Shenouda, Dimitris Papailiopoulos, Kangwook Lee, Robert D. Nowak

Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness.

Paper
Add Code

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks

1 code implementation • 14 Jun 2022 • Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

LIFT does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling "no-code machine learning with LMs."

BIG-bench Machine Learning General Classification +2

111

Paper
Code

Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

1 code implementation • 23 May 2022 • Tuan Dinh, Jy-yong Sohn, Shashank Rajput, Timothy Ossowski, Yifei Ming, Junjie Hu, Dimitris Papailiopoulos, Kangwook Lee

Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods.

Translation Word Alignment +2

Paper
Code

Rare Gems: Finding Lottery Tickets at Initialization

1 code implementation • 24 Feb 2022 • Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos

Frankle & Carbin conjecture that we can avoid this by training "lottery tickets", i. e., special sparse subnetworks found at initialization, that can be trained to high accuracy.

Paper
Code

GenLabel: Mixup Relabeling using Generative Models

no code implementations • 7 Jan 2022 • Jy-yong Sohn, Liang Shang, Hongxu Chen, Jaekyun Moon, Dimitris Papailiopoulos, Kangwook Lee

Mixup is a data augmentation method that generates new data points by mixing a pair of input data.

Adversarial Robustness Data Augmentation

Paper
Add Code

Finding Everything within Random Binary Networks

no code implementations • 18 Oct 2021 • Kartik Sreenivasan, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos

A recent work by Ramanujan et al. (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks.

Paper
Add Code

An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

no code implementations • NeurIPS 2021 • Shashank Rajput, Kartik Sreenivasan, Dimitris Papailiopoulos, Amin Karbasi

Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/\delta^2}+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(e^{1/\delta^2}(d+\sqrt{n})+n)$ weights, where $\delta$ is the minimum distance between the points.

Memorization

Paper
Add Code

Pufferfish: Communication-efficient Models At No Extra Cost

1 code implementation • 5 Mar 2021 • Hongyi Wang, Saurabh Agarwal, Dimitris Papailiopoulos

In this work, we present Pufferfish, a communication and computation efficient distributed training framework that incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks.

Quantization

Paper
Code

On the Utility of Gradient Compression in Distributed Training Systems

1 code implementation • 28 Feb 2021 • Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, Dimitris Papailiopoulos

A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training.

Model Compression

Paper
Code

Permutation-Based SGD: Is Random Optimal?

1 code implementation • ICLR 2022 • Shashank Rajput, Kangwook Lee, Dimitris Papailiopoulos

However, for general strongly convex functions, random permutations are optimal.

Paper
Code

Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient

1 code implementation • NeurIPS 2020 • Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos

We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep.

Paper
Code

Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

3 code implementations • 29 Oct 2020 • Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman, Dimitris Papailiopoulos

The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup.

Quantization

133

Paper
Code

Attack of the Tails: Yes, You Really Can Backdoor Federated Learning

2 code implementations • NeurIPS 2020 • Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, Dimitris Papailiopoulos

Due to its decentralized nature, Federated Learning (FL) lends itself to adversarial attacks in the form of backdoors during training.

Fairness Federated Learning +4

Paper
Code

Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

1 code implementation • 14 Jun 2020 • Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos

We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep.

Paper
Code

Closing the convergence gap of SGD without replacement

no code implementations • ICML 2020 • Shashank Rajput, Anant Gupta, Dimitris Papailiopoulos

A recent line of breakthrough works on SGD without replacement (SGDo) established an $\mathcal{O}\left(\frac{n}{T^2}\right)$ convergence rate when the function minimized is strongly convex and is a sum of $n$ smooth functions, and an $\mathcal{O}\left(\frac{1}{T^2}+\frac{n^3}{T^3}\right)$ rate for sums of quadratics.

Paper
Add Code

Federated Learning with Matched Averaging

1 code implementation • ICLR 2020 • Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, Yasaman Khazaeni

Federated learning allows edge devices to collaboratively learn a shared model while keeping the training data on device, decoupling the ability to do model training from the need to store the data in the cloud.

Federated Learning

322

Paper
Code

DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

1 code implementation • NeurIPS 2019 • Shashank Rajput, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation.

Paper
Code

Bad Global Minima Exist and SGD Can Reach Them

1 code implementation • NeurIPS 2020 • Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas

Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD).

Data Augmentation Image Classification

Paper
Code

Convergence and Margin of Adversarial Training on Separable Data

no code implementations • 22 May 2019 • Zachary Charles, Shashank Rajput, Stephen Wright, Dimitris Papailiopoulos

Our results are derived by showing that adversarial training with gradient updates minimizes a robust version of the empirical risk at a $\mathcal{O}(\ln(t)^2/t)$ rate, despite non-smoothness.

Paper
Add Code

Does Data Augmentation Lead to Positive Margin?

no code implementations • 8 May 2019 • Shashank Rajput, Zhili Feng, Zachary Charles, Po-Ling Loh, Dimitris Papailiopoulos

Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness.

Data Augmentation

Paper
Add Code

MLSys: The New Frontier of Machine Learning Systems

no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar

Machine learning (ML) techniques are enjoying rapidly increasing adoption.

BIG-bench Machine Learning

Paper
Add Code

ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

1 code implementation • 28 Jan 2019 • Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

We present ErasureHead, a new approach for distributed gradient descent (GD) that mitigates system delays by employing approximate gradient coding.

Paper
Code

A Geometric Perspective on the Transferability of Adversarial Directions

no code implementations • 8 Nov 2018 • Zachary Charles, Harrison Rosenberg, Dimitris Papailiopoulos

We show that these "transferable adversarial directions" are guaranteed to exist for linear separators of a given set, and will exist with high probability for linear classifiers trained on independent sets drawn from the same distribution.

Paper
Add Code

The Effect of Network Width on the Performance of Large-batch Training

no code implementations • NeurIPS 2018 • Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, Paraschos Koutris

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training.

Paper
Add Code

ATOMO: Communication-efficient Learning via Atomic Sparsification

1 code implementation • NeurIPS 2018 • Hongyi Wang, Scott Sievert, Zachary Charles, Shengchao Liu, Stephen Wright, Dimitris Papailiopoulos

We present ATOMO, a general framework for atomic sparsification of stochastic gradients.

Paper
Code

Gradient Coding via the Stochastic Block Model

no code implementations • 25 May 2018 • Zachary Charles, Dimitris Papailiopoulos

Gradient descent and its many variants, including mini-batch stochastic gradient descent, form the algorithmic foundation of modern large-scale machine learning.

Stochastic Block Model

Paper
Add Code

DRACO: Byzantine-resilient Distributed Training via Redundant Gradients

1 code implementation • ICML 2018 • Lingjiao Chen, Hongyi Wang, Zachary Charles, Dimitris Papailiopoulos

Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i. e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS).

Paper
Code

Approximate Gradient Coding via Sparse Random Graphs

no code implementations • 17 Nov 2017 • Zachary Charles, Dimitris Papailiopoulos, Jordan Ellenberg

Distributed algorithms are often beset by the straggler effect, where the slowest compute nodes in the system dictate the overall running time.

Paper
Add Code

Stability and Generalization of Learning Algorithms that Converge to Global Optima

no code implementations • ICML 2018 • Zachary Charles, Dimitris Papailiopoulos

Finally, we show that although our results imply comparable stability for SGD and GD in the PL setting, there exist simple neural networks with multiple local minima where SGD is stable but GD is not.

Generalization Bounds

Paper
Add Code

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

no code implementations • 18 Jun 2017 • Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, Peter Bartlett

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size.

Quantization

Paper
Add Code

CYCLADES: Conflict-free Asynchronous Machine Learning

1 code implementation • NeurIPS 2016 • Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael. I. Jordan, Kannan Ramchandran, Chris Re, Benjamin Recht

We present CYCLADES, a general framework for parallelizing stochastic optimization algorithms in a shared memory setting.

BIG-bench Machine Learning Stochastic Optimization

Paper
Code

Bipartite Correlation Clustering -- Maximizing Agreements

no code implementations • 9 Mar 2016 • Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, Alexandros G. Dimakis

We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters.

Clustering

Paper
Add Code

Speeding Up Distributed Machine Learning Using Codes

no code implementations • 8 Dec 2015 • Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, Kannan Ramchandran

We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling.

BIG-bench Machine Learning

Paper
Add Code

Orthogonal NMF through Subspace Exploration

no code implementations • NeurIPS 2015 • Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis

Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks $k$ nonnegative components that jointly capture most of the variance.

Clustering

Paper
Add Code

Sparse PCA via Bipartite Matchings

no code implementations • NeurIPS 2015 • Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance.

Paper
Add Code

Perturbed Iterate Analysis for Asynchronous Stochastic Optimization

no code implementations • 24 Jul 2015 • Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, Michael. I. Jordan

We demonstrate experimentally on a 16-core machine that the sparse and parallel version of SVRG is in some cases more than four orders of magnitude faster than the standard SVRG algorithm.

Stochastic Optimization

Paper
Add Code

On the Worst-Case Approximability of Sparse PCA

no code implementations • 21 Jul 2015 • Siu On Chan, Dimitris Papailiopoulos, Aviad Rubinstein

It is well known that Sparse PCA (Sparse Principal Component Analysis) is NP-hard to solve exactly on worst-case instances.

Paper
Add Code

Parallel Correlation Clustering on Big Graphs

no code implementations • NeurIPS 2015 • Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, Michael. I. Jordan

We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably.

Clustering

Paper
Add Code

Provable Deterministic Leverage Score Sampling

no code implementations • 6 Apr 2014 • Dimitris Papailiopoulos, Anastasios Kyrillidis, Christos Boutsidis

We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate".

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.