Search Results for author: LiMin Wang

Found 112 papers, 75 papers with code

Open-Vocabulary Spatio-Temporal Action Detection

no code implementations • 17 May 2024 • Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, LiMin Wang

Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good generalization performance on novel action classes.

Fine-Grained Action Detection Video Understanding

Paper
Add Code

Accelerating Image Generation with Sub-path Linear Approximation Model

no code implementations • 22 Apr 2024 • Chen Xu, Tianhui Song, Weixin Feng, Xubin Li, Tiezheng Ge, Bo Zheng, LiMin Wang

Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks.

Denoising Image Generation +1

Paper
Add Code

STMixer: A One-Stage Sparse Action Detector

no code implementations • 15 Apr 2024 • Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, LiMin Wang

First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain.

Action Detection

Paper
Add Code

Sparse Global Matching for Video Frame Interpolation with Large Motion

no code implementations • 10 Apr 2024 • Chunxu Liu, Guozhen Zhang, Rui Zhao, LiMin Wang

Large motion poses a critical challenge in Video Frame Interpolation (VFI) task.

Video Frame Interpolation

Paper
Add Code

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

no code implementations • 6 Apr 2024 • Tao Wu, Runyu He, Gangshan Wu, LiMin Wang

We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

Graph Generation Relation +4

Paper
Add Code

Dual DETRs for Multi-Label Temporal Action Detection

no code implementations • 31 Mar 2024 • Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, LiMin Wang

To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level.

Action Detection object-detection +1

Paper
Add Code

Multiple Object Tracking as ID Prediction

1 code implementation • 25 Mar 2024 • Ruopeng Gao, Yijun Zhang, LiMin Wang

In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association.

Ranked #1 on Multi-Object Tracking on DanceTrack (using extra training data)

Multi-Object Tracking Multiple Object Tracking +3

Paper
Code

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

1 code implementation • 24 Mar 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao

Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Ranked #1 on Zero-Shot Video Question Answer on MVBench

Action Classification Action Recognition +12

1,005

Paper
Code

Contextual AD Narration with Interleaved Multimodal Sequence

no code implementations • 19 Mar 2024 • Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, LiMin Wang

With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie.

Paper
Add Code

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

1 code implementation • 14 Mar 2024 • Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

Ranked #1 on Temporal Action Localization on FineAction

Moment Retrieval Temporal Action Localization +1

155

Paper
Code

VideoMamba: State Space Model for Efficient Video Understanding

3 code implementations • 11 Mar 2024 • Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

Video Understanding

619

Paper
Code

StableDrag: Stable Dragging for Point-based Image Editing

no code implementations • 7 Mar 2024 • Yutao Cui, Xiaotong Zhao, Guozhen Zhang, Shengming Cao, Kai Ma, LiMin Wang

Point-based image editing has attracted remarkable attention since the emergence of DragGAN.

Point Tracking

Paper
Add Code

Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

no code implementations • 1 Mar 2024 • Zhenpeng Huang, Chao Li, Hao Chen, Yongjian Deng, Yifeng Geng, LiMin Wang

Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams.

Knowledge Distillation Self-Supervised Learning

Paper
Add Code

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

no code implementations • 26 Jan 2024 • Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, LiMin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He, Yingchun Wang, Yixu Wang, Yongting Zhang, Yu Qiao, Yujiong Shen, Yurong Mou, Yuxi Chen, Zaibin Zhang, Zhelun Shi, Zhenfei Yin, Zhipin Wang

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents.

Paper
Add Code

Fully Sparse 3D Occupancy Prediction

3 code implementations • 28 Dec 2023 • Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, LiMin Wang

Occupancy prediction plays a pivotal role in autonomous driving.

Autonomous Driving

496

Paper
Code

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

no code implementations • 8 Dec 2023 • Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao

While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.

Question Answering Video Question Answering +1

Paper
Add Code

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

1 code implementation • 5 Dec 2023 • Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei zhang, LiMin Wang

Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons.

Image Generation Model Selection +3

Paper
Code

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

no code implementations • 4 Dec 2023 • Min Yang, Huan Gao, Ping Guo, LiMin Wang

To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels.

Action Detection Video Recognition

Paper
Add Code

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

1 code implementation • 30 Nov 2023 • Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, LiMin Wang, Dahua Lin, Bo Dai

Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications.

Neural Rendering

558

Paper
Code

VBench: Comprehensive Benchmark Suite for Video Generative Models

1 code implementation • 29 Nov 2023 • Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, LiMin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.

Image Generation Video Generation

309

Paper
Code

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

2 code implementations • 28 Nov 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Ranked #1 on Zero-Shot Video Question Answer on EgoSchema (fullset)

Fairness Multiple-choice +8

2,736

Paper
Code

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

no code implementations • 6 Nov 2023 • Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang

And AMD achieves 73. 3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3. 7% improvement over the original ViT-B model from VideoMAE.

Ranked #20 on Action Recognition on Something-Something V2

Action Classification Action Recognition +3

Paper
Add Code

Harvest Video Foundation Models via Efficient Post-Pretraining

1 code implementation • 30 Oct 2023 • Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo

Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.

Question Answering Text Retrieval +2

1,005

Paper
Code

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

no code implementations • 26 Oct 2023 • Fengyuan Shi, LiMin Wang

Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost.

Paper
Add Code

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

1 code implementation • 2 Oct 2023 • Xinhao Li, LiMin Wang

In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i. e., introduce zero extra cost to the adapted models during inference).

Ranked #5 on Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition +1

Paper
Code

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

no code implementations • 25 Aug 2023 • Jiaming Zhang, Yutao Cui, Gangshan Wu, LiMin Wang

To overcome these issues, we propose a unified VOS framework, coined as JointFormer, for joint modeling the three elements of feature, correspondence, and a compressed memory.

Semantic Segmentation Video Object Segmentation +1

Paper
Add Code

MGMAE: Motion Guided Masking for Video Masked Autoencoding

1 code implementation • ICCV 2023 • Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, LiMin Wang

Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos.

Optical Flow Estimation Representation Learning

Paper
Code

DPL: Decoupled Prompt Learning for Vision-Language Models

no code implementations • 19 Aug 2023 • Chen Xu, Yuhan Zhu, Guozhen Zhang, Haocheng Shen, Yixuan Liao, Xiaoxin Chen, Gangshan Wu, LiMin Wang

Prompt learning has emerged as an efficient and effective approach for transferring foundational Vision-Language Models (e. g., CLIP) to downstream tasks.

Paper
Add Code

Deep Equilibrium Object Detection

1 code implementation • ICCV 2023 • Shuai Wang, Yao Teng, LiMin Wang

To be more specific to object decoding, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement.

Decoder Object +2

Paper
Code

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

1 code implementation • ICCV 2023 • Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, LiMin Wang

Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost.

Ranked #5 on 3D Object Detection on nuScenes Camera Only

3D Object Detection Object +1

282

Paper
Code

On Data Imbalance in Molecular Property Prediction with Pre-training

no code implementations • 17 Aug 2023 • LiMin Wang, Masatoshi Hanai, Toyotaro Suzumura, Shun Takashige, Kenjiro Taura

In this study, we propose an effective pre-training method that addresses the imbalance in input data.

Molecular Property Prediction Property Prediction

Paper
Add Code

Is Self-Supervised Pretraining Good for Extrapolation in Molecular Property Prediction?

no code implementations • 16 Aug 2023 • Shun Takashige, Masatoshi Hanai, Toyotaro Suzumura, LiMin Wang, Kenjiro Taura

In material science, the prediction of unobserved values, commonly referred to as extrapolation, is particularly critical for property prediction as it enables researchers to gain insight into materials beyond the limits of available data.

Molecular Property Prediction Property Prediction

Paper
Add Code

Memory-and-Anticipation Transformer for Online Action Understanding

1 code implementation • ICCV 2023 • Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu

Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.

Ranked #1 on Action Detection on THUMOS' 14

Action Understanding Online Action Detection

Paper
Code

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

1 code implementation • ICCV 2023 • Ruopeng Gao, LiMin Wang

Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7. 9% and 13. 0% on HOTA and AssA metrics, respectively.

Ranked #6 on Multi-Object Tracking on SportsMOT

Multi-Object Tracking Multiple Object Tracking +1

126

Paper
Code

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, we utilize a multi-scale approach to generate video-related descriptions.

Action Recognition Contrastive Learning +7

1,005

Paper
Code

Transferring Foundation Models for Generalizable Robotic Manipulation

no code implementations • 9 Jun 2023 • Jiange Yang, Wenhui Tan, Chuhao Jin, Keling Yao, Bei Liu, Jianlong Fu, Ruihua Song, Gangshan Wu, LiMin Wang

In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks.

Imitation Learning Object +1

Paper
Add Code

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

no code implementations • 30 May 2023 • Chuhao Jin, Wenhui Tan, Jiange Yang, Bei Liu, Ruihua Song, LiMin Wang, Jianlong Fu

We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks.

Robot Manipulation

Paper
Add Code

MixFormerV2: Efficient Fully Transformer Tracking

1 code implementation • NeurIPS 2023 • Yutao Cui, Tianhui Song, Gangshan Wu, LiMin Wang

Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.

120

Paper
Code

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Decoder Video Understanding

152

Paper
Code

VideoChat: Chat-Centric Video Understanding

1 code implementation • 10 May 2023 • Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.

Ranked #1 on Question Answering on NExT-QA (Open-ended VideoQA)

Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +5

2,736

Paper
Code

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

3,145

Paper
Code

An Evidential Real-Time Multi-Mode Fault Diagnosis Approach Based on Broad Learning System

1 code implementation • 29 Apr 2023 • Chen Li, Zeyi Liu, LiMin Wang, Minyue Li, Xiao He

Fault diagnosis is a crucial area of research in industry.

Pseudo Label

Paper
Code

Efficient Video Action Detection with Token Dropout and Context Refinement

2 code implementations • ICCV 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang

Our EVAD consists of two specialized designs for video action detection.

Action Detection Decoder

1,230

Paper
Code

Progressive Visual Prompt Learning with Contrastive Feature Re-formation

no code implementations • 17 Apr 2023 • Chen Xu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, LiMin Wang

To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.

Paper
Add Code

SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes

1 code implementation • ICCV 2023 • Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, LiMin Wang

We expect SportsMOT to encourage the MOT trackers to promote in both motion-based association and appearance-based association.

Ranked #3 on Multi-Object Tracking on SportsMOT (using extra training data)

Multi-Object Tracking Multiple Object Tracking +1

120

Paper
Code

StageInteractor: Query-based Object Detector with Cross-stage Interaction

1 code implementation • ICCV 2023 • Yao Teng, Haisong Liu, Sheng Guo, LiMin Wang

Most of these detectors are trained with one-to-many label assignment strategies.

Decoder Object +2

Paper
Code

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

1 code implementation • 7 Apr 2023 • Ziteng Gao, Zhan Tong, LiMin Wang, Mike Zheng Shou

In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.

Sparse Representation-based Classification Video Classification

Paper
Code

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

1 code implementation • CVPR 2023 • LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition In Videos +4

419

Paper
Code

CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection

no code implementations • 28 Mar 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang

Existing studies model each actor and scene relation to improve action recognition.

Action Detection Action Recognition +2

Paper
Add Code

STMixer: A One-Stage Sparse Action Detector

no code implementations • CVPR 2023 • Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, LiMin Wang

STMixer is based on two core designs.

Paper
Add Code

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

1 code implementation • ICCV 2023 • Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao

Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.

Ranked #1 on Video Retrieval on SSv2-template retrieval (using extra training data)

Action Classification Action Recognition +5

249

Paper
Code

LinK: Linear Kernel for LiDAR-based 3D Perception

1 code implementation • CVPR 2023 • Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, LiMin Wang

Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity.

3D Object Detection 3D Semantic Segmentation +1

Paper
Code

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

1 code implementation • CVPR 2023 • Hanlin Wang, Yilu Wu, Sheng Guo, LiMin Wang

In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution.

Paper
Code

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

1 code implementation • 21 Mar 2023 • Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, LiMin Wang

To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM).

Ranked #1 on Scene Flow Estimation on KITTI 2015 Scene Flow Test

Optical Flow Estimation Scene Flow Estimation +1

207

Paper
Code

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

1 code implementation • CVPR 2023 • Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, LiMin Wang

In this paper, we propose a novel module to explicitly extract motion and appearance information via a unifying operation.

Ranked #1 on Video Frame Interpolation on MSU Video Frame Interpolation (PSNR metric)

Video Frame Interpolation

320

Paper
Code

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

1 code implementation • 13 Feb 2023 • Jiange Yang, Sheng Guo, Gangshan Wu, LiMin Wang

Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.

Contrastive Learning Representation Learning +1

Paper
Code

MixFormer: End-to-End Tracking with Iterative Mixed Attention

1 code implementation • 6 Feb 2023 • Yutao Cui, Cheng Jiang, Gangshan Wu, LiMin Wang

Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.

Ranked #1 on Visual Object Tracking on TrackingNet

Visual Object Tracking

426

Paper
Code

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

no code implementations • ICCV 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

The prolific performances of Vision Transformers (ViTs) in image tasks have prompted research into adapting the image ViTs for video tasks.

Video Understanding

Paper
Add Code

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2 code implementations • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Contrastive Learning +8

1,005

Paper
Code

VLG: General Video Recognition with Web Textual Knowledge

1 code implementation • 3 Dec 2022 • Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, LiMin Wang

Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.

Video Recognition

Paper
Code

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

3 code implementations • 17 Nov 2022 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.

Video Understanding

3,953

Paper
Code

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao

In this report, we present our champion solutions to five tracks at Ego4D challenge.

Ranked #1 on State Change Object Detection on Ego4D

Future Hand Prediction Moment Queries +7

Paper
Code

Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 2022

no code implementations • 16 Nov 2022 • Yin-Dong Zheng, Guo Chen, Jiahao Wang, Tong Lu, LiMin Wang

Our method achieves an accuracy of 0. 796 on OSCC while achieving an absolute temporal localization error of 0. 516 on PNR.

Human-Object Interaction Detection Object +3

Paper
Add Code

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

1 code implementation • 20 Oct 2022 • Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, LiMin Wang

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e. g., ActivityNet, THUMOS).

Ranked #4 on Temporal Action Localization on MultiTHUMOS

Action Detection Temporal Action Localization

Paper
Code

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

no code implementations • 28 Sep 2022 • Fengyuan Shi, Ruopeng Gao, Weilin Huang, LiMin Wang

The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features.

Decoder Visual Grounding

Paper
Add Code

Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

no code implementations • 30 Jun 2022 • Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, LiMin Wang

Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence.

Boundary Detection Generic Event Boundary Detection +1

Paper
Add Code

Cross-Architecture Self-supervised Video Representation Learning

1 code implementation • CVPR 2022 • Sheng Guo, Zihua Xiong, Yujie Zhong, LiMin Wang, Xiaobo Guo, Bing Han, Weilin Huang

In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning.

Action Recognition Contrastive Learning +4

Paper
Code

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection

2 code implementations • 5 May 2022 • Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang

Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.

Ranked #1 on Temporal Action Localization on THUMOS14

Action Detection object-detection +3

Paper
Code

APP-Net: Auxiliary-point-based Push and Pull Operations for Efficient Point Cloud Classification

1 code implementation • 2 May 2022 • Tao Lu, Chunxu Liu, Youxin Chen, Gangshan Wu, LiMin Wang

In the existing work, each point in the cloud may inevitably be selected as the neighbors of multiple aggregation centers, as all centers will gather neighbor features from the whole point cloud independently.

Ranked #45 on 3D Point Cloud Classification on ScanObjectNN

3D Classification 3D Point Cloud Classification +1

Paper
Code

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

2 code implementations • 25 Apr 2022 • Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, LiMin Wang

This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries.

Denoising valid

Paper
Code

Logit Normalization for Long-tail Object Detection

1 code implementation • 31 Mar 2022 • Liang Zhao, Yao Teng, LiMin Wang

Real-world data exhibiting skewed distributions pose a serious challenge to existing object detectors.

Object object-detection +1

Paper
Code

AdaMixer: A Fast-Converging Query-Based Object Detector

2 code implementations • CVPR 2022 • Ziteng Gao, LiMin Wang, Bing Han, Sheng Guo

The recent query-based object detectors break this convention by decoding image features with a set of learnable queries.

Object Object Detection

236

Paper
Code

Task-specific Inconsistency Alignment for Domain Adaptive Object Detection

1 code implementation • CVPR 2022 • Liang Zhao, LiMin Wang

To address this issue, in this paper, we propose Task-specific Inconsistency Alignment (TIA), by developing a new alignment mechanism in separate task spaces, improving the performance of the detector on both subtasks.

Object object-detection +1

Paper
Code

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

4 code implementations • 23 Mar 2022 • Zhan Tong, Yibing Song, Jue Wang, LiMin Wang

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.

Ranked #5 on Self-Supervised Action Recognition on HMDB51

4k Action Classification +3

126,503

Paper
Code

MixFormer: End-to-End Tracking with Iterative Mixed Attention

1 code implementation • CVPR 2022 • Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.

Ranked #7 on Visual Object Tracking on UAV123

Semi-Supervised Video Object Segmentation Visual Object Tracking

426

Paper
Code

Recovering 3D Human Mesh from Monocular Images: A Survey

1 code implementation • 3 Mar 2022 • Yating Tian, Hongwen Zhang, Yebin Liu, LiMin Wang

Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention.

3D human pose and shape estimation Human Mesh Recovery

331

Paper
Code

Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection

no code implementations • 1 Mar 2022 • Jing Tan, Yuhong Wang, Gangshan Wu, LiMin Wang

Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs.

Avg Boundary Detection +1

Paper
Add Code

Hilbert Flattening: a Locality-Preserving Matrix Unfolding Method for Visual Discrimination

no code implementations • 21 Feb 2022 • Qingsong Zhao, Yi Wang, Zhipeng Zhou, Duoqian Miao, LiMin Wang, Yu Qiao, Cairong Zhao

Flattening is essential in computer vision by converting multi-dimensional feature maps or images into one-dimensional vectors.

Image Classification Representation Learning +1

Paper
Add Code

OCSampler: Compressing Videos to One Clip with Single-step Sampling

1 code implementation • CVPR 2022 • Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, LiMin Wang

Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step.

Video Recognition

Paper
Code

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

3 code implementations • CVPR 2022 • Jiaqi Tang, Zhaoyang Liu, Chen Qian, Wayne Wu, LiMin Wang

Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries.

Boundary Detection Generic Event Boundary Detection +1

Paper
Code

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

1 code implementation • 7 Dec 2021 • Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu

Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.

Ranked #19 on Temporal Action Localization on ActivityNet-1.3

Action Detection Temporal Action Localization

Paper
Code

A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark

1 code implementation • 24 Oct 2021 • Zhenxi Zhu, LiMin Wang, Sheng Guo, Gangshan Wu

In this paper, we aim to present an in-depth study on few-shot video classification by making three contributions.

Classification Meta-Learning +2

Paper
Code

End-to-End Dense Video Grounding via Parallel Regression

no code implementations • 23 Sep 2021 • Fengyuan Shi, Weilin Huang, LiMin Wang

In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input.

regression Sentence +1

Paper
Add Code

Mutual Supervision for Dense Object Detection

no code implementations • ICCV 2021 • Ziteng Gao, LiMin Wang, Gangshan Wu

In this paper, we break the convention of the same training samples for these two heads in dense detectors and explore a novel supervisory paradigm, termed as Mutual Supervision (MuSu), to respectively and mutually assign training samples for the classification and regression head to ensure this consistency.

Classification Dense Object Detection +3

Paper
Add Code

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2 code implementations • 10 Sep 2021 • Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu

Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.

Ranked #3 on Temporal Sentence Grounding on Charades-STA

Metric Learning Representation Learning +2

Paper
Code

Self Supervision to Distillation for Long-Tailed Visual Recognition

1 code implementation • ICCV 2021 • TianHao Li, LiMin Wang, Gangshan Wu

In this paper, we show that soft label can serve as a powerful solution to incorporate label correlation into a multi-stage training scheme for long-tailed recognition.

Ranked #43 on Long-tail Learning on CIFAR-100-LT (ρ=100)

Long-tail Learning

Paper
Code

Target Adaptive Context Aggregation for Video Scene Graph Generation

1 code implementation • ICCV 2021 • Yao Teng, LiMin Wang, Zhifeng Li, Gangshan Wu

Specifically, we design an efficient method for frame-level VidSGG, termed as {\em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition.

Graph Generation Relation +2

Paper
Code

Structured Sparse R-CNN for Direct Scene Graph Generation

4 code implementations • CVPR 2022 • Yao Teng, LiMin Wang

The key to our method is a set of learnable triplet queries and a structured triplet detector which could be jointly optimized from the training set in an end-to-end manner.

graph construction Graph Generation +4

Paper
Code

CGA-Net: Category Guided Aggregation for Point Cloud Semantic Segmentation

1 code implementation • CVPR 2021 • Tao Lu, LiMin Wang, Gangshan Wu

Previous point cloud semantic segmentation networks use the same process to aggregate features from neighbors of the same category and different categories.

Ranked #1 on Semantic Segmentation on SYNTHIA

Segmentation Semantic Segmentation

Paper
Code

Joint Landmark and Structure Learning for Automatic Evaluation of Developmental Dysplasia of the Hip

no code implementations • 10 Jun 2021 • Xindi Hu, LiMin Wang, Xin Yang, Xu Zhou, Wufeng Xue, Yan Cao, Shengfeng Liu, Yuhao Huang, Shuangping Guo, Ning Shang, Dong Ni, Ning Gu

In this study, we propose a multi-task framework to learn the relationships among landmarks and structures jointly and automatically evaluate DDH.

Paper
Add Code

SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

1 code implementation • 6 Jun 2021 • Zeyu Ruan, Changqing Zou, Longhai Wu, Gangshan Wu, LiMin Wang

Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information is commonly missing in occluded and large pose face images.

Ranked #1 on 3D Face Reconstruction on AFLW2000-3D

3D Face Alignment 3D Face Reconstruction +3

123

Paper
Code

FineAction: A Fine-Grained Video Dataset for Temporal Action Localization

1 code implementation • 24 May 2021 • Yi Liu, LiMin Wang, Yali Wang, Xiao Ma, Yu Qiao

Temporal action localization (TAL) is an important and challenging problem in video understanding.

Fine-Grained Action Detection Temporal Localization +2

Paper
Code

MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions

1 code implementation • ICCV 2021 • Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, LiMin Wang

Spatio-temporal action detection is an important and challenging problem in video understanding.

Action Detection Action Localization +1

101

Paper
Code

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

1 code implementation • ICCV 2021 • Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu

First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.

Action Recognition Temporal Action Localization

Paper
Code

Target Transformed Regression for Accurate Tracking

1 code implementation • 1 Apr 2021 • Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

Accurate tracking is still a challenging task due to appearance variations, pose and view changes, and geometric deformations of target in videos.

Ranked #1 on Visual Object Tracking on VOT2019

regression Visual Object Tracking +1

Paper
Code

PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop

2 code implementations • ICCV 2021 • Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, LiMin Wang, Zhenan Sun

Regression-based methods have recently shown promising results in reconstructing human meshes from monocular images.

Ranked #5 on 3D Human Pose Estimation on AGORA (using extra training data)

3D human pose and shape estimation 3D Human Reconstruction +2

580

Paper
Code

Relaxed Transformer Decoders for Direct Action Proposal Generation

2 code implementations • ICCV 2021 • Jing Tan, Jiaqi Tang, LiMin Wang, Gangshan Wu

Extensive experiments on THUMOS14 and ActivityNet-1. 3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection.

Action Detection Temporal Action Proposal Generation +1

Paper
Code

Temporal Difference Networks for Action Recognition

no code implementations • 1 Jan 2021 • LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Recognition In Videos

Paper
Add Code

TDN: Temporal Difference Networks for Efficient Action Recognition

1 code implementation • CVPR 2021 • LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu

Ranked #17 on Action Recognition on Something-Something V1

Action Classification Action Recognition In Videos

362

Paper
Code

Appearance-and-Relation Networks for Video Classification

1 code implementation • CVPR 2018 • Limin Wang, Wei Li, Wen Li, Luc van Gool

Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.

Ranked #51 on Action Recognition on UCF101

Action Classification Action Recognition +6

203

Paper
Code

Temporal Segment Networks for Action Recognition in Videos

11 code implementations • 8 May 2017 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Ranked #5 on Video Classification on COIN

Action Classification Action Recognition In Videos +3

3,953

Paper
Code

UntrimmedNets for Weakly Supervised Action Recognition and Detection

2 code implementations • CVPR 2017 • Limin Wang, Yuanjun Xiong, Dahua Lin, Luc van Gool

We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet.

Ranked #3 on Action Classification on ActivityNet-1.2

Weakly Supervised Action Localization Weakly-Supervised Action Recognition

162

Paper
Code

Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs

2 code implementations • 4 Oct 2016 • Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, Yu Qiao

Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2.

General Classification Scene Classification +1

550

Paper
Code

Transferring Object-Scene Convolutional Neural Networks for Event Recognition in Still Images

no code implementations • 1 Sep 2016 • Limin Wang, Zhe Wang, Yu Qiao, Luc van Gool

These newly designed transferring techniques exploit multi-task learning frameworks to incorporate extra knowledge from other networks and additional datasets into the training procedure of event CNNs.

Multi-Task Learning

Paper
Add Code

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

19 code implementations • 2 Aug 2016 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.

Ranked #3 on Multimodal Activity Recognition on EV-Action

Action Classification Action Recognition In Videos +2

3,953

Paper
Code

Actionness Estimation Using Hybrid Fully Convolutional Networks

no code implementations • CVPR 2016 • Limin Wang, Yu Qiao, Xiaoou Tang, Luc van Gool

Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location.

Ranked #11 on Action Detection on J-HMDB

Action Detection Action Recognition +1

Paper
Add Code

Better Exploiting OS-CNNs for Better Event Recognition in Images

no code implementations • 14 Oct 2015 • Limin Wang, Zhe Wang, Sheng Guo, Yu Qiao

Event recognition from still images is one of the most important problems for image understanding.

Object Object Recognition +1

Paper
Add Code

Places205-VGGNet Models for Scene Recognition

2 code implementations • 7 Aug 2015 • Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao

We verify the performance of trained Places205-VGGNet models on three datasets: MIT67, SUN397, and Places205.

Computational Efficiency Object Recognition +1

Paper
Code

Towards Good Practices for Very Deep Two-Stream ConvNets

5 code implementations • 8 Jul 2015 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao

However, for action recognition in videos, the improvement of deep convolutional networks is not so evident.

Ranked #66 on Action Recognition on UCF101

Action Recognition In Videos Computational Efficiency +3

558

Paper
Code

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

1 code implementation • CVPR 2015 • Limin Wang, Yu Qiao, Xiaoou Tang

Visual features are of vital importance for human action understanding in videos.

Ranked #59 on Action Recognition on HMDB-51

Action Recognition Action Understanding +1

Paper
Code

Object-Scene Convolutional Neural Networks for Event Recognition in Images

no code implementations • 2 May 2015 • Limin Wang, Zhe Wang, Wenbin Du, Yu Qiao

Meanwhile, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep (GoogLeNet) networks to the task of event recognition.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.