no code implementations • 14 Jun 2021 • Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR).
no code implementations • 9 Nov 2020 • Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-Feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig
In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 5 Nov 2020 • Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers.
Ranked #15 on Speech Recognition on LibriSpeech test-clean
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 28 Sep 2020 • Mandela Patrick, Yuki Asano, Polina Kuznetsova, Ruth Fong, Joao F. Henriques, Geoffrey Zweig, Andrea Vedaldi
In this paper, we show that, for videos, the answer is more complex, and that better results can be obtained by accounting for the interplay between invariance, distinctiveness, multiple modalities and time.
no code implementations • 4 Jun 2020 • Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf
By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 19 May 2020 • Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig
In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results.
Ranked #17 on Speech Recognition on LibriSpeech test-other (using extra training data)
no code implementations • 16 May 2020 • Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdel-rahman Mohamed
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems.
no code implementations • 15 May 2020 • Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth Saraf, Geoffrey Zweig
Videos uploaded on social media are often accompanied with textual descriptions.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • ICCV 2021 • Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi
In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning.
no code implementations • 27 Oct 2019 • Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data.
1 code implementation • 23 Oct 2019 • Andros Tjandra, Chunxi Liu, Frank Zhang, Xiaohui Zhang, Yongqiang Wang, Gabriel Synnaeve, Satoshi Nakamura, Geoffrey Zweig
As our motivation is to allow acoustic models to re-examine their input features in light of partial hypotheses we introduce intermediate model heads and loss function.
no code implementations • 22 Oct 2019 • Yongqiang Wang, Abdel-rahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer
We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition.
Ranked #23 on Speech Recognition on LibriSpeech test-other (using extra training data)
no code implementations • 2 Oct 2019 • Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fügen, Geoffrey Zweig, Michael L. Seltzer
There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance, especially on English which has poor grapheme-phoneme correspondence.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • LREC 2020 • Chunxi Liu, Qiaochu Zhang, Xiaohui Zhang, Kritika Singh, Yatharth Saraf, Geoffrey Zweig
Towards developing high-performing ASR for low-resource languages, approaches to address the lack of resources are to make use of data from multiple languages, and to augment the training data by creating acoustic variations.
no code implementations • EACL 2017 • Baolin Peng, Michael Seltzer, Y.C. Ju, Geoffrey Zweig, Kam-Fai Wong
This is motivated by an actual system under development to assist in the order taking process.
3 code implementations • ACL 2017 • Jason D. Williams, Kavosh Asadi, Geoffrey Zweig
End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple behaviors.
no code implementations • 3 Jun 2016 • Jason D. Williams, Geoffrey Zweig
This paper presents a model for end-to-end learning of task-oriented dialog systems.
no code implementations • 3 Jun 2016 • Kaisheng Yao, Baolin Peng, Geoffrey Zweig, Kam-Fai Wong
Experimental results indicate that the model outperforms previously proposed neural conversation architectures, and that using specificity in the objective function significantly improves performances for both generation and retrieval.
no code implementations • 29 Oct 2015 • Kaisheng Yao, Geoffrey Zweig, Baolin Peng
The intention network is a recurrent network that models the dynamics of the intention process.
no code implementations • 31 May 2015 • Kaisheng Yao, Geoffrey Zweig
We find that the simple side-conditioned generation approach is able to rival the state-of-the-art, and we are able to significantly advance the stat-of-the-art with bi-directional long short-term memory (LSTM) neural networks that use the same alignment information that is used in conventional approaches.
no code implementations • IJCNLP 2015 • Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, Margaret Mitchell
Two recent approaches have achieved state-of-the-art results in image captioning.
1 code implementation • CVPR 2015 • Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig
The language model learns from a set of over 400, 000 image descriptions to capture the statistics of word usage.
Ranked #1 on Image Captioning on COCO Captions test