no code implementations • 4 Feb 2024 • YongDeok Kim, Jaehyung Ahn, Myeongwoo Kim, Changin Choi, Heejae Kim, Narankhuu Tuvshinjargal, Seungwon Lee, Yanzi Zhang, Yuan Pei, Xiongzhan Linghu, Jingkun Ma, Lin Chen, Yuehua Dai, Sungjoo Yoo
Speeding up the large-scale distributed training is challenging in that it requires improving various components of training including load balancing, communication, optimizers, etc.