no code implementations • 17 May 2024 • Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, LiMin Wang
Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good generalization performance on novel action classes.
no code implementations • 22 Apr 2024 • Chen Xu, Tianhui Song, Weixin Feng, Xubin Li, Tiezheng Ge, Bo Zheng, LiMin Wang
Diffusion models have significantly advanced the state of the art in image, audio, and video generation tasks.
no code implementations • 15 Apr 2024 • Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, LiMin Wang
First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain.
no code implementations • 10 Apr 2024 • Chunxu Liu, Guozhen Zhang, Rui Zhao, LiMin Wang
Large motion poses a critical challenge in Video Frame Interpolation (VFI) task.
no code implementations • 6 Apr 2024 • Tao Wu, Runyu He, Gangshan Wu, LiMin Wang
We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
no code implementations • 31 Mar 2024 • Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, LiMin Wang
To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level.
1 code implementation • 25 Mar 2024 • Ruopeng Gao, Yijun Zhang, LiMin Wang
In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association.
Ranked #1 on Multi-Object Tracking on DanceTrack (using extra training data)
1 code implementation • 24 Mar 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao
Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.
2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on Zero-Shot Video Question Answer on MVBench
no code implementations • 19 Mar 2024 • Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, LiMin Wang
With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie.
1 code implementation • 14 Mar 2024 • Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Ranked #1 on Temporal Action Localization on FineAction
3 code implementations • 11 Mar 2024 • Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.
no code implementations • 7 Mar 2024 • Yutao Cui, Xiaotong Zhao, Guozhen Zhang, Shengming Cao, Kai Ma, LiMin Wang
Point-based image editing has attracted remarkable attention since the emergence of DragGAN.
no code implementations • 1 Mar 2024 • Zhenpeng Huang, Chao Li, Hao Chen, Yongjian Deng, Yifeng Geng, LiMin Wang
Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams.
no code implementations • 26 Jan 2024 • Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, LiMin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He, Yingchun Wang, Yixu Wang, Yongting Zhang, Yu Qiao, Yujiong Shen, Yurong Mou, Yuxi Chen, Zaibin Zhang, Zhelun Shi, Zhenfei Yin, Zhipin Wang
Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents.
3 code implementations • 28 Dec 2023 • Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, LiMin Wang
Occupancy prediction plays a pivotal role in autonomous driving.
no code implementations • 8 Dec 2023 • Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao
While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.
1 code implementation • 5 Dec 2023 • Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei zhang, LiMin Wang
Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons.
no code implementations • 4 Dec 2023 • Min Yang, Huan Gao, Ping Guo, LiMin Wang
To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels.
1 code implementation • 30 Nov 2023 • Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, LiMin Wang, Dahua Lin, Bo Dai
Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications.
1 code implementation • 29 Nov 2023 • Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, LiMin Wang, Dahua Lin, Yu Qiao, Ziwei Liu
We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.
2 code implementations • 28 Nov 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
no code implementations • 6 Nov 2023 • Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang
And AMD achieves 73. 3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3. 7% improvement over the original ViT-B model from VideoMAE.
Ranked #20 on Action Recognition on Something-Something V2
1 code implementation • 30 Oct 2023 • Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo
Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.
no code implementations • 26 Oct 2023 • Fengyuan Shi, LiMin Wang
Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost.
1 code implementation • 2 Oct 2023 • Xinhao Li, LiMin Wang
In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i. e., introduce zero extra cost to the adapted models during inference).
Ranked #5 on Action Recognition on UCF101 (using extra training data)
no code implementations • 25 Aug 2023 • Jiaming Zhang, Yutao Cui, Gangshan Wu, LiMin Wang
To overcome these issues, we propose a unified VOS framework, coined as JointFormer, for joint modeling the three elements of feature, correspondence, and a compressed memory.
1 code implementation • ICCV 2023 • Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, LiMin Wang
Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos.
no code implementations • 19 Aug 2023 • Chen Xu, Yuhan Zhu, Guozhen Zhang, Haocheng Shen, Yixuan Liao, Xiaoxin Chen, Gangshan Wu, LiMin Wang
Prompt learning has emerged as an efficient and effective approach for transferring foundational Vision-Language Models (e. g., CLIP) to downstream tasks.
1 code implementation • ICCV 2023 • Shuai Wang, Yao Teng, LiMin Wang
To be more specific to object decoding, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement.
1 code implementation • ICCV 2023 • Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, LiMin Wang
Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost.
Ranked #5 on 3D Object Detection on nuScenes Camera Only
no code implementations • 17 Aug 2023 • LiMin Wang, Masatoshi Hanai, Toyotaro Suzumura, Shun Takashige, Kenjiro Taura
In this study, we propose an effective pre-training method that addresses the imbalance in input data.
no code implementations • 16 Aug 2023 • Shun Takashige, Masatoshi Hanai, Toyotaro Suzumura, LiMin Wang, Kenjiro Taura
In material science, the prediction of unobserved values, commonly referred to as extrapolation, is particularly critical for property prediction as it enables researchers to gain insight into materials beyond the limits of available data.
1 code implementation • ICCV 2023 • Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu
Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.
Ranked #1 on Action Detection on THUMOS' 14
1 code implementation • ICCV 2023 • Ruopeng Gao, LiMin Wang
Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7. 9% and 13. 0% on HOTA and AssA metrics, respectively.
Ranked #6 on Multi-Object Tracking on SportsMOT
1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, we utilize a multi-scale approach to generate video-related descriptions.
no code implementations • 9 Jun 2023 • Jiange Yang, Wenhui Tan, Chuhao Jin, Keling Yao, Bei Liu, Jianlong Fu, Ruihua Song, Gangshan Wu, LiMin Wang
In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks.
no code implementations • 30 May 2023 • Chuhao Jin, Wenhui Tan, Jiange Yang, Bei Liu, Ruihua Song, LiMin Wang, Jianlong Fu
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks.
1 code implementation • NeurIPS 2023 • Yutao Cui, Tianhui Song, Gangshan Wu, LiMin Wang
Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.
1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang
Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.
1 code implementation • 10 May 2023 • Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.
Ranked #1 on Question Answering on NExT-QA (Open-ended VideoQA)
Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +5
2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao
Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.
1 code implementation • 29 Apr 2023 • Chen Li, Zeyi Liu, LiMin Wang, Minyue Li, Xiao He
Fault diagnosis is a crucial area of research in industry.
2 code implementations • ICCV 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang
Our EVAD consists of two specialized designs for video action detection.
no code implementations • 17 Apr 2023 • Chen Xu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, LiMin Wang
To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.
1 code implementation • ICCV 2023 • Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, LiMin Wang
We expect SportsMOT to encourage the MOT trackers to promote in both motion-based association and appearance-based association.
Ranked #3 on Multi-Object Tracking on SportsMOT (using extra training data)
1 code implementation • ICCV 2023 • Yao Teng, Haisong Liu, Sheng Guo, LiMin Wang
Most of these detectors are trained with one-to-many label assignment strategies.
1 code implementation • 7 Apr 2023 • Ziteng Gao, Zhan Tong, LiMin Wang, Mike Zheng Shou
In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.
Sparse Representation-based Classification Video Classification
1 code implementation • CVPR 2023 • LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao
Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).
Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)
no code implementations • 28 Mar 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang
Existing studies model each actor and scene relation to improve action recognition.
no code implementations • CVPR 2023 • Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, LiMin Wang
STMixer is based on two core designs.
1 code implementation • ICCV 2023 • Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao
Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.
Ranked #1 on Video Retrieval on SSv2-template retrieval (using extra training data)
1 code implementation • CVPR 2023 • Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, LiMin Wang
Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity.
1 code implementation • CVPR 2023 • Hanlin Wang, Yilu Wu, Sheng Guo, LiMin Wang
In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution.
1 code implementation • 21 Mar 2023 • Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, LiMin Wang
To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM).
Ranked #1 on Scene Flow Estimation on KITTI 2015 Scene Flow Test
1 code implementation • CVPR 2023 • Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, LiMin Wang
In this paper, we propose a novel module to explicitly extract motion and appearance information via a unifying operation.
Ranked #1 on Video Frame Interpolation on MSU Video Frame Interpolation (PSNR metric)
1 code implementation • 13 Feb 2023 • Jiange Yang, Sheng Guo, Gangshan Wu, LiMin Wang
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
1 code implementation • 6 Feb 2023 • Yutao Cui, Cheng Jiang, Gangshan Wu, LiMin Wang
Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Ranked #1 on Visual Object Tracking on TrackingNet
no code implementations • ICCV 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao
The prolific performances of Vision Transformers (ViTs) in image tasks have prompted research into adapting the image ViTs for video tasks.
2 code implementations • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)
1 code implementation • 3 Dec 2022 • Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, LiMin Wang
Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.
3 code implementations • 17 Nov 2022 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao
UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
no code implementations • 16 Nov 2022 • Yin-Dong Zheng, Guo Chen, Jiahao Wang, Tong Lu, LiMin Wang
Our method achieves an accuracy of 0. 796 on OSCC while achieving an absolute temporal localization error of 0. 516 on PNR.
1 code implementation • 20 Oct 2022 • Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, LiMin Wang
Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e. g., ActivityNet, THUMOS).
Ranked #4 on Temporal Action Localization on MultiTHUMOS
no code implementations • 28 Sep 2022 • Fengyuan Shi, Ruopeng Gao, Weilin Huang, LiMin Wang
The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features.
no code implementations • 30 Jun 2022 • Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, LiMin Wang
Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence.
1 code implementation • CVPR 2022 • Sheng Guo, Zihua Xiong, Yujie Zhong, LiMin Wang, Xiaobo Guo, Bing Han, Weilin Huang
In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning.
2 code implementations • 5 May 2022 • Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang
Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.
Ranked #1 on Temporal Action Localization on THUMOS14
1 code implementation • 2 May 2022 • Tao Lu, Chunxu Liu, Youxin Chen, Gangshan Wu, LiMin Wang
In the existing work, each point in the cloud may inevitably be selected as the neighbors of multiple aggregation centers, as all centers will gather neighbor features from the whole point cloud independently.
Ranked #45 on 3D Point Cloud Classification on ScanObjectNN
2 code implementations • 25 Apr 2022 • Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, LiMin Wang
This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries.
1 code implementation • 31 Mar 2022 • Liang Zhao, Yao Teng, LiMin Wang
Real-world data exhibiting skewed distributions pose a serious challenge to existing object detectors.
2 code implementations • CVPR 2022 • Ziteng Gao, LiMin Wang, Bing Han, Sheng Guo
The recent query-based object detectors break this convention by decoding image features with a set of learnable queries.
1 code implementation • CVPR 2022 • Liang Zhao, LiMin Wang
To address this issue, in this paper, we propose Task-specific Inconsistency Alignment (TIA), by developing a new alignment mechanism in separate task spaces, improving the performance of the detector on both subtasks.
4 code implementations • 23 Mar 2022 • Zhan Tong, Yibing Song, Jue Wang, LiMin Wang
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.
Ranked #5 on Self-Supervised Action Recognition on HMDB51
1 code implementation • CVPR 2022 • Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu
Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Ranked #7 on Visual Object Tracking on UAV123
Semi-Supervised Video Object Segmentation Visual Object Tracking
1 code implementation • 3 Mar 2022 • Yating Tian, Hongwen Zhang, Yebin Liu, LiMin Wang
Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention.
no code implementations • 1 Mar 2022 • Jing Tan, Yuhong Wang, Gangshan Wu, LiMin Wang
Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs.
no code implementations • 21 Feb 2022 • Qingsong Zhao, Yi Wang, Zhipeng Zhou, Duoqian Miao, LiMin Wang, Yu Qiao, Cairong Zhao
Flattening is essential in computer vision by converting multi-dimensional feature maps or images into one-dimensional vectors.
1 code implementation • CVPR 2022 • Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, LiMin Wang
Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step.
3 code implementations • CVPR 2022 • Jiaqi Tang, Zhaoyang Liu, Chen Qian, Wayne Wu, LiMin Wang
Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries.
1 code implementation • 7 Dec 2021 • Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu
Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.
Ranked #19 on Temporal Action Localization on ActivityNet-1.3
1 code implementation • 24 Oct 2021 • Zhenxi Zhu, LiMin Wang, Sheng Guo, Gangshan Wu
In this paper, we aim to present an in-depth study on few-shot video classification by making three contributions.
no code implementations • 23 Sep 2021 • Fengyuan Shi, Weilin Huang, LiMin Wang
In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input.
no code implementations • ICCV 2021 • Ziteng Gao, LiMin Wang, Gangshan Wu
In this paper, we break the convention of the same training samples for these two heads in dense detectors and explore a novel supervisory paradigm, termed as Mutual Supervision (MuSu), to respectively and mutually assign training samples for the classification and regression head to ensure this consistency.
2 code implementations • 10 Sep 2021 • Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu
Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.
Ranked #3 on Temporal Sentence Grounding on Charades-STA
1 code implementation • ICCV 2021 • TianHao Li, LiMin Wang, Gangshan Wu
In this paper, we show that soft label can serve as a powerful solution to incorporate label correlation into a multi-stage training scheme for long-tailed recognition.
Ranked #43 on Long-tail Learning on CIFAR-100-LT (ρ=100)
1 code implementation • ICCV 2021 • Yao Teng, LiMin Wang, Zhifeng Li, Gangshan Wu
Specifically, we design an efficient method for frame-level VidSGG, termed as {\em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition.
4 code implementations • CVPR 2022 • Yao Teng, LiMin Wang
The key to our method is a set of learnable triplet queries and a structured triplet detector which could be jointly optimized from the training set in an end-to-end manner.
1 code implementation • CVPR 2021 • Tao Lu, LiMin Wang, Gangshan Wu
Previous point cloud semantic segmentation networks use the same process to aggregate features from neighbors of the same category and different categories.
Ranked #1 on Semantic Segmentation on SYNTHIA
no code implementations • 10 Jun 2021 • Xindi Hu, LiMin Wang, Xin Yang, Xu Zhou, Wufeng Xue, Yan Cao, Shengfeng Liu, Yuhao Huang, Shuangping Guo, Ning Shang, Dong Ni, Ning Gu
In this study, we propose a multi-task framework to learn the relationships among landmarks and structures jointly and automatically evaluate DDH.
1 code implementation • 6 Jun 2021 • Zeyu Ruan, Changqing Zou, Longhai Wu, Gangshan Wu, LiMin Wang
Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information is commonly missing in occluded and large pose face images.
Ranked #1 on 3D Face Reconstruction on AFLW2000-3D
1 code implementation • 24 May 2021 • Yi Liu, LiMin Wang, Yali Wang, Xiao Ma, Yu Qiao
Temporal action localization (TAL) is an important and challenging problem in video understanding.
1 code implementation • ICCV 2021 • Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, LiMin Wang
Spatio-temporal action detection is an important and challenging problem in video understanding.
1 code implementation • ICCV 2021 • Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu
First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.
1 code implementation • 1 Apr 2021 • Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu
Accurate tracking is still a challenging task due to appearance variations, pose and view changes, and geometric deformations of target in videos.
Ranked #1 on Visual Object Tracking on VOT2019
2 code implementations • ICCV 2021 • Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, LiMin Wang, Zhenan Sun
Regression-based methods have recently shown promising results in reconstructing human meshes from monocular images.
Ranked #5 on 3D Human Pose Estimation on AGORA (using extra training data)
3D human pose and shape estimation 3D Human Reconstruction +2
2 code implementations • ICCV 2021 • Jing Tan, Jiaqi Tang, LiMin Wang, Gangshan Wu
Extensive experiments on THUMOS14 and ActivityNet-1. 3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection.
no code implementations • 1 Jan 2021 • LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu
To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.
1 code implementation • CVPR 2021 • LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu
To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.
Ranked #17 on Action Recognition on Something-Something V1
1 code implementation • CVPR 2018 • Limin Wang, Wei Li, Wen Li, Luc van Gool
Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.
Ranked #51 on Action Recognition on UCF101
11 code implementations • 8 May 2017 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool
Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
Ranked #5 on Video Classification on COIN
2 code implementations • CVPR 2017 • Limin Wang, Yuanjun Xiong, Dahua Lin, Luc van Gool
We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet.
Ranked #3 on Action Classification on ActivityNet-1.2
Weakly Supervised Action Localization Weakly-Supervised Action Recognition
2 code implementations • 4 Oct 2016 • Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, Yu Qiao
Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2.
no code implementations • 1 Sep 2016 • Limin Wang, Zhe Wang, Yu Qiao, Luc van Gool
These newly designed transferring techniques exploit multi-task learning frameworks to incorporate extra knowledge from other networks and additional datasets into the training procedure of event CNNs.
19 code implementations • 2 Aug 2016 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool
The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.
Ranked #3 on Multimodal Activity Recognition on EV-Action
no code implementations • CVPR 2016 • Limin Wang, Yu Qiao, Xiaoou Tang, Luc van Gool
Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location.
Ranked #11 on Action Detection on J-HMDB
no code implementations • 14 Oct 2015 • Limin Wang, Zhe Wang, Sheng Guo, Yu Qiao
Event recognition from still images is one of the most important problems for image understanding.
2 code implementations • 7 Aug 2015 • Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao
We verify the performance of trained Places205-VGGNet models on three datasets: MIT67, SUN397, and Places205.
5 code implementations • 8 Jul 2015 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao
However, for action recognition in videos, the improvement of deep convolutional networks is not so evident.
Ranked #66 on Action Recognition on UCF101
1 code implementation • CVPR 2015 • Limin Wang, Yu Qiao, Xiaoou Tang
Visual features are of vital importance for human action understanding in videos.
Ranked #59 on Action Recognition on HMDB-51
no code implementations • 2 May 2015 • Limin Wang, Zhe Wang, Wenbin Du, Yu Qiao
Meanwhile, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep (GoogLeNet) networks to the task of event recognition.