no code implementations • 3 Apr 2024 • Yichuan Deng, Zhao Song, Chiwun Yang
The computational intensity of Large Language Models (LLMs) is a critical bottleneck, primarily due to the $O(n^2)$ complexity of the attention mechanism in transformer architectures.
no code implementations • 2 Feb 2024 • Yichuan Deng, Zhao Song, Chiwun Yang
Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc.
no code implementations • 19 Oct 2023 • Yichuan Deng, Zhao Song, Shenghao Xie, Chiwun Yang
In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks.
no code implementations • 18 Oct 2023 • Yichuan Deng, Zhao Song, Tianyi Zhou
Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks.
no code implementations • 21 Aug 2023 • Yichuan Deng, Michalis Mamakos, Zhao Song
Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships.
no code implementations • 16 Aug 2023 • Yichuan Deng, Zhao Song, Shenghao Xie
Softmax unit and ReLU unit are the key structure in attention computation.
no code implementations • 17 Jul 2023 • Yichuan Deng, Zhihang Li, Sridhar Mahadevan, Zhao Song
We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale LLMs.
no code implementations • 1 Jun 2023 • Yichuan Deng, Zhao Song, Junze Yin
Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data.
no code implementations • 20 Apr 2023 • Yichuan Deng, Zhihang Li, Zhao Song
One of the key computation in LLMs is the softmax unit.
no code implementations • 13 Apr 2023 • Yichuan Deng, Yeqi Gao, Zhao Song
For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019].
no code implementations • 10 Apr 2023 • Yichuan Deng, Sridhar Mahadevan, Zhao Song
It runs in $\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} ) $ time, has $1-\delta$ succeed probability, and chooses $m = O(n \log(n/\delta))$.
no code implementations • 8 Mar 2023 • Yichuan Deng, Zhao Song, Zifan Wang, Han Zhang
The kernel method, which is commonly used in learning algorithms such as Support Vector Machines (SVMs), has also been applied in PCA algorithms.
no code implementations • 9 Aug 2022 • Yichuan Deng, Hang Hu, Zhao Song, Omri Weinstein, Danyang Zhuo
The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI).
no code implementations • 19 Feb 2022 • Rui Duan, Hui Deng, Mao Tian, Yichuan Deng, Jiarui Lin
In this manner, this research contributes a large-scale image dataset for the development of deep learning-based object detection methods in the construction industry and sets up a performance benchmark for further evaluation of corresponding algorithms in this area.