2 code implementations • 27 Mar 2024 • Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.
Ranked #9 on Visual Question Answering on MM-Vet
1 code implementation • 14 Mar 2024 • Chengyao Wang, Li Jiang, Xiaoyang Wu, Zhuotao Tian, Bohao Peng, Hengshuang Zhao, Jiaya Jia
To address this issue, we propose GroupContrast, a novel approach that combines segment grouping and semantic-aware contrastive learning.
2 code implementations • 28 Nov 2023 • Yanwei Li, Chengyao Wang, Jiaya Jia
Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.
Ranked #6 on Zero-Shot Video Question Answer on ActivityNet-QA
Image Captioning Video-based Generative Performance Benchmarking +2
no code implementations • 27 Jun 2023 • Bohao Peng, Zhuotao Tian, Xiaoyang Wu, Chengyao Wang, Shu Liu, Jingyong Su, Jiaya Jia
We hope our work can benefit broader industrial applications where novel classes with limited annotations are required to be decently identified.