no code implementations • 13 May 2024 • Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs).
1 code implementation • 29 Apr 2024 • Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang
It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model.
1 code implementation • 5 Feb 2024 • Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang
Several design formulas are empirically proved especially effective for tiny language models, including tokenizer compression, architecture tweaking, parameter inheritance and multiple-round training.
1 code implementation • 6 Jun 2022 • Yunsheng Ni, Depu Meng, Changqian Yu, Chengbin Quan, Dongchun Ren, Youjian Zhao
Specifically, we first capture the different representations with different augmentations, then regularize the cosine distance of the representations to enhance the consistency.