1 code implementation • 13 Mar 2024 • Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius
Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.
no code implementations • 11 Mar 2024 • Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging.
no code implementations • 4 Oct 2023 • Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
We first determine the sparsity ratios of different layers or blocks by leveraging the global importance score, which is efficiently computed based on the zeroth-order approximation of the global model gradients.
1 code implementation • 2 Oct 2023 • Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, Tianlong Chen
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks, however, they have issues like (a) High Memory Usage, due to duplication of the network layers into multiple copies as experts; and (b) Redundancy in Experts, as common learning-based routing policies suffer from representational collapse.
1 code implementation • ICCV 2023 • Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
Specifically, our model captures the cross-modal similarity information at different granularity levels.
Ranked #11 on Video Retrieval on MSR-VTT
1 code implementation • 28 Apr 2023 • Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, Lijuan Wang
In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities.
1 code implementation • CVPR 2023 • Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.
Ranked #4 on Audio-visual Question Answering on MUSIC-AVQA
2 code implementations • 13 Jun 2022 • Yi-Lin Sung, Jaemin Cho, Mohit Bansal
LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2. 7x more memory savings).
1 code implementation • CVPR 2022 • Yi-Lin Sung, Jaemin Cho, Mohit Bansal
Our results demonstrate that training the adapter with the weight-sharing technique (4. 18% of total parameters for image-text tasks and 3. 39% for video-text tasks) can match the performance of fine-tuning the entire model.
1 code implementation • NeurIPS 2021 • Yi-Lin Sung, Varun Nair, Colin Raffel
In this paper, we show that it is possible to induce a fixed sparse mask on the model's parameters that selects a subset to update over many iterations.
no code implementations • ICLR 2019 • Yi-Lin Sung, Sung-Hsien Hsieh, Soo-Chang Pei, Chun-Shien Lu
DSGAN considers the scenario that the training samples of target distribution, $p_{t}$, are difficult to collect.