1 code implementation • EMNLP 2021 • Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael Mahoney
Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29. 45/17. 29 BLEU on IWSLT14/WMT14.
no code implementations • 21 Mar 2024 • Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer
The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs.
1 code implementation • 5 Mar 2024 • Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, Zhangyang Wang
To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead.
2 code implementations • 25 Jan 2024 • Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song
However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference.
2 code implementations • 14 Dec 2023 • Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao
With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.
no code implementations • 26 Oct 2023 • Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He
Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference.
2 code implementations • 25 Sep 2023 • Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms.
no code implementations • 2 Sep 2023 • Fengxiang Bie, Yibo Yang, Zhongzhu Zhou, Adam Ghanem, Minjia Zhang, Zhewei Yao, Xiaoxia Wu, Connor Holmes, Pareesa Golnari, David A. Clifton, Yuxiong He, DaCheng Tao, Shuaiwen Leon Song
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
1 code implementation • 2 Aug 2023 • Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He
ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance.
1 code implementation • 19 Jul 2023 • Xiaoxia Wu, Zhewei Yao, Yuxiong He
In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge.
no code implementations • 16 May 2023 • Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He
This study examines the impact of optimizing the Stable Diffusion (SD) guided inference pipeline.
2 code implementations • 15 Mar 2023 • Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs).
no code implementations • 13 Mar 2023 • Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He
The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs).
1 code implementation • 27 Jan 2023 • Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He
Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost.
1 code implementation • 7 Dec 2022 • Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He
Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities.
1 code implementation • 17 Nov 2022 • Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He
Large-scale transformer models have become the de-facto architectures for various machine learning applications, e. g., CV and NLP.
1 code implementation • 29 Jul 2022 • Yuxin Ma, Ping Gong, Jun Yi, Zhewei Yao, Cheng Li, Yuxiong He, Feng Yan
We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within $\epsilon$ of the optimal loss of uncompressed network.
1 code implementation • 4 Jun 2022 • Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.
3 code implementations • 4 Jun 2022 • Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He
How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.
3 code implementations • 14 Jan 2022 • Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He
As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model.
1 code implementation • 8 Sep 2021 • Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael W. Mahoney
Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29. 45/17. 29 BLEU on IWSLT14/WMT14.
4 code implementations • 13 Jul 2021 • Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
Ranked #4 on Vision and Language Navigation on RxR (using extra training data)
1 code implementation • 30 May 2021 • Zhewei Yao, Xiaoxia Wu, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney, Yuxiong He
Moreover, in order to reduce hyperparameter tuning, a novel adaptive regularization coefficient is deployed to control the regularization penalty adaptively.
4 code implementations • 29 Apr 2021 • Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, Joseph E. Gonzalez
On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss.
1 code implementation • 31 Mar 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer
End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 25 Mar 2021 • Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks.
1 code implementation • 22 Jan 2021 • Shixing Yu, Zhewei Yao, Amir Gholami, Zhen Dong, Sehoon Kim, Michael W Mahoney, Kurt Keutzer
To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric for structured pruning.
4 code implementations • 5 Jan 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks.
Natural Language Inference Natural Language Understanding +1
1 code implementation • 20 Nov 2020 • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
1 code implementation • NeurIPS 2020 • Jianfei Chen, Yu Gai, Zhewei Yao, Michael W. Mahoney, Joseph E. Gonzalez
We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance.
Ranked #9 on Semantic Textual Similarity on STS Benchmark
1 code implementation • EMNLP 2020 • Qinxin Wang, Hao Tan, Sheng Shen, Michael W. Mahoney, Zhewei Yao
Phrase localization is a task that studies the mapping from textual phrases to regions of an image.
1 code implementation • 26 Aug 2020 • Zhengming Zhang, Yaoqing Yang, Zhewei Yao, Yujun Yan, Joseph E. Gonzalez, Michael W. Mahoney
Replacing BN with the recently-proposed Group Normalization (GN) can reduce gradient diversity and improve test accuracy.
3 code implementations • 1 Jun 2020 • Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN.
1 code implementation • ICML 2020 • Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass.
Ranked #12 on Machine Translation on WMT2014 English-German
3 code implementations • CVPR 2020 • Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
Importantly, ZeroQ has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0. 5\% of one epoch training time of ResNet50 on ImageNet).
Ranked #1 on Data Free Quantization on CIFAR10 (CIFAR-10 W8A8 Top-1 Accuracy metric)
2 code implementations • 16 Dec 2019 • Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney
To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks.
1 code implementation • NeurIPS 2019 • Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E. Gonzalez, Kurt Keutzer, Michael W. Mahoney, George Biros
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).
2 code implementations • NeurIPS 2020 • Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
However, the search space for a mixed-precision quantization is exponential in the number of layers.
no code implementations • 12 Sep 2019 • Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further.
Ranked #13 on Semantic Textual Similarity on STS Benchmark
no code implementations • 10 Jun 2019 • Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, Michael Mahoney
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).
no code implementations • 31 May 2019 • Kai Rothauge, Zhewei Yao, Zixi Hu, Michael W. Mahoney
We regard pre-trained residual networks (ResNets) as nonlinear systems and use linearization, a common method used in the qualitative analysis of nonlinear systems, to understand the behavior of the networks under small perturbations of the input images.
1 code implementation • ICCV 2019 • Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer
Another challenge is a similar factorial complexity for determining block-wise fine-tuning order when quantizing the model to a target precision.
1 code implementation • 7 Apr 2019 • N. Benjamin Erichson, Zhewei Yao, Michael W. Mahoney
To complement these approaches, we propose a very simple and inexpensive strategy which can be used to ``retrofit'' a previously-trained network to improve its resilience to adversarial attacks.
no code implementations • 14 Mar 2019 • Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney
In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch.
1 code implementation • 20 Feb 2019 • N. Benjamin Erichson, Lionel Mathelin, Zhewei Yao, Steven L. Brunton, Michael W. Mahoney, J. Nathan Kutz
In many applications, it is important to reconstruct a fluid flow field, or some other high-dimensional state, from limited measurements and limited data.
2 code implementations • CVPR 2019 • Zhewei Yao, Amir Gholami, Peng Xu, Kurt Keutzer, Michael Mahoney
To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently.
no code implementations • 4 Dec 2018 • Norman Mu, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney
We demonstrate the ability of our method to improve language modeling performance by up to 7. 91 perplexity and reduce training iterations by up to $61\%$, in addition to its flexibility in enabling snapshot ensembling and use with adversarial training.
Ranked #51 on Natural Language Inference on SNLI
no code implementations • 30 Nov 2018 • Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, Joseph Gonzalez
Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique.
1 code implementation • ICLR 2019 • Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph Gonzalez, Kurt Keutzer, Michael Mahoney
Our method exceeds the performance of existing solutions in terms of both accuracy and the number of SGD iterations (up to 1\% and $5\times$, respectively).
6 code implementations • NeurIPS 2018 • Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, Michael W. Mahoney
Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably higher Hessian spectrum.