no code implementations • 15 Dec 2023 • Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
no code implementations • 10 Mar 2022 • Xiaohan Lan, Yitian Yuan, Xin Wang, Long Chen, Zhi Wang, Lin Ma, Wenwu Zhu
New benchmarking results indicate that our proposed evaluation protocols can better monitor the research progress.
1 code implementation • 2 Dec 2021 • Yitian Yuan, Lin Ma, Jingwen Wang, Wenwu Zhu
In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence.
1 code implementation • 2 Dec 2021 • Yitian Yuan, Lin Ma, Wenwu Zhu
Enhancing the diversity of sentences to describe video contents is an important problem arising in recent video captioning research.
no code implementations • 16 Sep 2021 • Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, Wenwu Zhu
In this survey, we give a comprehensive overview for TSGV, which i) summarizes the taxonomy of existing methods, ii) provides a detailed description of the evaluation protocols(i. e., datasets and metrics) to be used in TSGV, and iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations.
no code implementations • 22 Jan 2021 • Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, Wenwu Zhu
All the results demonstrate that the re-organized dataset splits and new metric can better monitor the progress in TSGV.
1 code implementation • NeurIPS 2019 • Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, Wenwu Zhu
Temporal sentence grounding in videos aims to detect and localize one target video segment, which semantically corresponds to a given sentence.
1 code implementation • 12 Aug 2019 • Yitian Yuan, Lin Ma, Wenwu Zhu
With the tremendous growth of videos over the Internet, video thumbnails, providing video content previews, are becoming increasingly crucial to influencing users' online searching experiences.
no code implementations • 19 Apr 2018 • Yitian Yuan, Tao Mei, Wenwu Zhu
Then, a multi-modal co-attention mechanism is introduced to generate not only video attention which reflects the global video structure, but also sentence attention which highlights the crucial details for temporal localization.