2 code implementations • ACL 2021 • Zhenhai Zhu, Radu Soricut
We describe an efficient hierarchical method to compute attention in the Transformer architecture.
Ranked #1 on Language Modelling on One Billion Word (Validation perplexity metric)
1 code implementation • Asian Chapter of the Association for Computational Linguistics 2020 • Gabriel Huang, Bo Pang, Zhenhai Zhu, Clara Rivera, Radu Soricut
First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.
Ranked #1 on Dense Video Captioning on YouCook2 (ROUGE-L metric, using extra training data)
1 code implementation • EMNLP 2020 • Jack Hessel, Zhenhai Zhu, Bo Pang, Radu Soricut
Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • CONLL 2019 • Jack Hessel, Bo Pang, Zhenhai Zhu, Radu Soricut
Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e. g., "heat the oil in the pan") improves user experiences.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
2 code implementations • ICCV 2017 • Si-Qi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, Kevin Murphy
Finally, we show that using our PG method we can optimize any of the metrics, including the proposed SPIDEr metric which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.