Search Results for author: Jiashuo Yu

Found 13 papers, 10 papers with code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Ranked #1 on Zero-Shot Video Question Answer on MVBench

Action Classification Action Recognition +12

1,024

Paper
Code

VBench: Comprehensive Benchmark Suite for Video Generative Models

1 code implementation • 29 Nov 2023 • Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, LiMin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.

Image Generation Video Generation

321

Paper
Code

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

no code implementations • 31 Oct 2023 • Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu

The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos.

Paper
Add Code

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

2 code implementations • 26 Sep 2023 • Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu

To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model.

Ranked #4 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Text-to-Video Generation Video Generation +1

755

Paper
Code

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, we utilize a multi-scale approach to generate video-related descriptions.

Action Recognition Contrastive Learning +7

1,024

Paper
Code

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

3,144

Paper
Code

Long-Term Rhythmic Video Soundtracker

1 code implementation • 2 May 2023 • Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao

To this end, we present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms.

Paper
Code

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2 code implementations • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Contrastive Learning +8

1,024

Paper
Code

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao

In this report, we present our champion solutions to five tracks at Ego4D challenge.

Ranked #1 on State Change Object Detection on Ego4D

Future Hand Prediction Moment Queries +7

Paper
Code

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

1 code implementation • 12 Jul 2022 • Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang

In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning.

Ranked #6 on Anomaly Detection In Surveillance Videos on XD-Violence

Anomaly Detection In Surveillance Videos audio-visual learning +1

Paper
Code

Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization

no code implementations • 7 Jul 2022 • Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, Ying Shan

Although audio-visual representation has been proved to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated.

Contrastive Learning Representation Learning +2

Paper
Add Code

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

1 code implementation • 24 Nov 2021 • Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, Yuejie Zhang

Recognizing and localizing events in videos is a fundamental task for video understanding.

audio-visual event localization Video Understanding

Paper
Code

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

no code implementations • 7 Apr 2021 • Jiashuo Yu, Ying Cheng, Rui Feng

The localization subnetwork consists of Multimodal Bottleneck Attention Module (MBAM), which is designed to extract fine-grained segment-level contents.

audio-visual event localization General Classification

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.