no code implementations • 2 May 2024 • Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen
We evaluate the trained Mantis on five multi-image benchmarks and eight single-image benchmarks.
2 code implementations • 12 Apr 2024 • Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, Lin Ma
Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks.
no code implementations • 21 Mar 2024 • Max Ku, Cong Wei, Weiming Ren, Harry Yang, Wenhu Chen
In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video.
1 code implementation • 6 Feb 2024 • Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, Wenhu Chen
To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation.
no code implementations • 22 Dec 2023 • Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen
We evaluate VIESCORE on seven prominent tasks in conditional image tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of 0. 3 with human evaluations, while the human-to-human correlation is 0. 45.
no code implementations • 28 Nov 2023 • Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image.
2 code implementations • 27 Nov 2023 • Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
no code implementations • 22 Jun 2023 • Tianle Li, Max Ku, Cong Wei, Wenhu Chen
In this work, we aspire to fill the void and propose two novel subject-driven sub-tasks, i. e., Subject Replacement and Subject Addition.
1 code implementation • CVPR 2023 • Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi, Graham W. Taylor, Florian Shkurti
Equipped with the learned unstructured attention pattern, sparse attention ViT (Sparsifiner) produces a superior Pareto-optimal trade-off between FLOPs and top-1 accuracy on ImageNet compared to token sparsity.