1 code implementation • 25 Jan 2024 • Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina
This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading.
no code implementations • 25 Jan 2024 • Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai
This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs).
no code implementations • 1 Apr 2019 • Leyang Xue, Peng Zhang, An Zeng
Notably, an optimal parameter n* of ARL existed in long-term recommendation, indicating that there is a trade-off between keeping diversity of item and user's preference to maximize the long-term recommendation accuracy.
no code implementations • 29 Mar 2019 • Peng Zhang, Leyang Xue, An Zeng
The results show that the higher recommendation accuracy based on diffusion algorithms can still be achieved by optimizing the way of resource allocation on a density network.