no code implementations • 30 May 2024 • Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun
Following this paradigm, we propose VLoRA with the perceptual weights generator.
no code implementations • 29 May 2024 • Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun
Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding.
no code implementations • 18 Mar 2024 • Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, Wenjun Zeng
Humans constantly interact with their surrounding environments.
1 code implementation • 16 Jan 2024 • Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, Qingyao Wu
A large amount of User Generated Content (UGC) is uploaded to the Internet daily and displayed to people world-widely through the client side (e. g., mobile and PC).
no code implementations • 26 Dec 2023 • Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, Yunhui Liu, Wenjun Zeng, Xiaokang Yang
We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions, semantic interaction categories, interaction order, and the relationship and personality of the subjects.
no code implementations • 29 May 2023 • Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun
This potential can be harnessed to create synthetic image-text pairs for training captioning models.
1 code implementation • 25 May 2023 • Zhenhua Liu, Feipeng Ma, Tianyi Wang, Fengyun Rao
We propose a Similarity Alignment Model(SAM) for video copy segment matching.
1 code implementation • 21 May 2023 • Tianyi Wang, Feipeng Ma, Zhenhua Liu, Fengyun Rao
With the development of multimedia technology, Video Copy Detection has been a crucial problem for social media platforms.
no code implementations • CVPR 2022 • Zhaoyang Zeng, Yongsheng Luo, Zhenhua Liu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen
In this paper, we propose the Tencent-MVSE dataset, which is the first benchmark dataset for the multi-modal video similarity evaluation task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 9 Dec 2021 • Lu Qi, Jason Kuen, Zhe Lin, Jiuxiang Gu, Fengyun Rao, Dian Li, Weidong Guo, Zhen Wen, Ming-Hsuan Yang, Jiaya Jia
To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either task-unrelated or task-specific training signals from unlabeled data.
no code implementations • 13 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, Xiu Li
It is noted that our model is only trained on the MSR-VTT dataset.
no code implementations • 11 Oct 2021 • Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, Dian Li
We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features.