no code implementations • 30 May 2024 • Wenxuan Liu, Sai Qian Zhang
Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net.
no code implementations • 8 Apr 2024 • Chao GAO, Sai Qian Zhang
Due to the scale of LLM, PEFT operations are usually executed in the public environment (e. g., cloud server).
no code implementations • 21 Mar 2024 • Zeyu Han, Chao GAO, Jinyang Liu, Jeff Zhang, Sai Qian Zhang
In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms.
no code implementations • 28 Nov 2023 • YiXuan Luo, Mengye Ren, Sai Qian Zhang
This approach significantly reduces computational costs in comparison with training each DNN backbone individually.
no code implementations • 4 May 2023 • Sai Qian Zhang, Thierry Tambe, Nestor Cuevas, Gu-Yeon Wei, David Brooks
To minimize the occurrence of expensive eDRAM refresh operations, it is beneficial to shorten the lifetime of stored data during the training process.
no code implementations • 19 Jul 2022 • Xin Dong, Sai Qian Zhang, Ang Li, H. T. Kung
Federated Learning aims at training a global model from multiple decentralized devices (i. e. clients) without exchanging their private local data.
no code implementations • 9 Jan 2022 • Sai Qian Zhang, Jieyu Lin, Qi Zhang
Federated learning (FL) is a training technique that enables client devices to jointly learn a shared model by aggregating locally-computed models without exposing their raw data.
no code implementations • 28 Oct 2021 • Sai Qian Zhang, Bradley McDanel, H. T. Kung
Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values.
no code implementations • 29 Sep 2021 • Sai Qian Zhang, Bradley McDanel
By leveraging the N:M sparsity constraint, we can identify the unimportant weights across each group of M weights at earlier stages of iterative pruning, which significantly lowers the cost of iterative training compared to conventional unstructured pruning.
1 code implementation • NeurIPS 2020 • Sai Qian Zhang, Jieyu Lin, Qi Zhang
Recent studies have shown that introducing communication between agents can significantly improve overall performance in cooperative Multi-agent reinforcement learning (MARL).
Multi-agent Reinforcement Learning Reinforcement Learning (RL)
no code implementations • 13 Jul 2020 • H. T. Kung, Bradley McDanel, Sai Qian Zhang
To perform conversion from binary to SDR, we develop an efficient encoding method called HESE (Hybrid Encoding for Signed Expressions) that can be performed in one pass looking at only two bits at a time.
1 code implementation • 8 Mar 2020 • Jieyu Lin, Kristina Dzeparoska, Sai Qian Zhang, Alberto Leon-Garcia, Nicolas Papernot
Our results on the StartCraft II multi-agent benchmark demonstrate that c-MARL teams are highly vulnerable to perturbations applied to one of their agent's observations.
Multi-agent Reinforcement Learning reinforcement-learning +1
no code implementations • 4 Dec 2019 • Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, Wei Wang
We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and the unexploited hardware acceleration of ternary networks.
2 code implementations • NeurIPS 2019 • Sai Qian Zhang, Qi Zhang, Jieyu Lin
Multi-agent reinforcement learning (MARL) has recently received considerable attention due to its applicability to a wide range of real-world applications.
Multi-agent Reinforcement Learning reinforcement-learning +3
no code implementations • 1 May 2019 • Bradley McDanel, Sai Qian Zhang, H. T. Kung, Xin Dong
A highlight of our full-stack approach which attributes to the achieved high energy efficiency is an efficient Selector-Accumulator (SAC) architecture for implementing the multiplier-accumulator (MAC) operation present in any digital CNN hardware.
no code implementations • 7 Nov 2018 • H. T. Kung, Bradley McDanel, Sai Qian Zhang
We study the effectiveness of this joint optimization for both high utilization and classification accuracy with ASIC and FPGA designs based on efficient bit-serial implementations of multiplier-accumulators.