no code implementations • 17 Sep 2023 • Roshan Sharma, Suyoun Kim, Daniel Lazar, Trang Le, Akshat Shrivastava, Kwanghoon Ahn, Piyush Kansal, Leda Sari, Ozlem Kalinli, Michael Seltzer
Using the generated text with JAT and TTS for spoken semantic parsing improves EM on STOP by 1. 4% and 2. 6% absolute for existing and new domains respectively.
no code implementations • 22 Jul 2023 • Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
no code implementations • 15 Nov 2022 • Derek Xu, Shuyan Dong, Changhan Wang, Suyoun Kim, Zhaojiang Lin, Akshat Shrivastava, Shang-Wen Li, Liang-Hsuan Tseng, Alexei Baevski, Guan-Ting Lin, Hung-Yi Lee, Yizhou Sun, Wei Wang
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +10
no code implementations • 31 Oct 2022 • Suyoun Kim, Ke Li, Lucas Kabela, Rongqing Huang, Jiedan Zhu, Ozlem Kalinli, Duc Le
In this work, we present our Joint Audio/Text training method for Transformer Rescorer, to leverage unpaired text-only data which is relatively cheaper than paired audio-text data.
no code implementations • 4 Apr 2022 • Duc Le, Akshat Shrivastava, Paden Tomasello, Suyoun Kim, Aleksandr Livshits, Ozlem Kalinli, Michael L. Seltzer
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU), where a streaming automatic speech recognition (ASR) model produces the first-pass hypothesis and a second-pass natural language understanding (NLU) component generates the semantic parse by conditioning on both ASR's text and audio embeddings.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 11 Oct 2021 • Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu Zhai, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
Measuring automatic speech recognition (ASR) system quality is critical for creating user-satisfying voice-driven applications.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 5 Apr 2021 • Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer
How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area.
no code implementations • 5 Apr 2021 • Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
We define SemDist as the distance between a reference and hypothesis pair in a sentence-level embedding space.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +14
1 code implementation • 5 Nov 2020 • Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers.
Ranked #15 on Speech Recognition on LibriSpeech test-clean
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 26 Oct 2020 • Suyoun Kim, Yuan Shangguan, Jay Mahadeokar, Antoine Bruguier, Christian Fuegen, Michael L. Seltzer, Duc Le
Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech recognition model architectures, has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training.
no code implementations • 24 Jul 2019 • Suyoun Kim, Siddharth Dalmia, Florian Metze
We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information.
no code implementations • ACL 2019 • Suyoun Kim, Siddharth Dalmia, Florian Metze
We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings.
no code implementations • NAACL 2019 • Suyoun Kim, Florian Metze
Conversational context information, higher-level knowledge that spans across sentences, can help to recognize a long conversation.
no code implementations • 7 Aug 2018 • Suyoun Kim, Florian Metze
Existing speech recognition systems are typically built at the sentence level, although it is known that dialog context, e. g. higher-level knowledge that spans across sentences or speakers, can help the processing of long conversations.
no code implementations • 6 Nov 2017 • Suyoun Kim, Michael L. Seltzer
Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters.
1 code implementation • 6 Nov 2017 • Suyoun Kim, Michael L. Seltzer, Jinyu Li, Rui Zhao
Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training.
8 code implementations • 21 Sep 2016 • Suyoun Kim, Takaaki Hori, Shinji Watanabe
Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments.
no code implementations • 11 Jan 2016 • Suyoun Kim, Bhiksha Raj, Ian Lane
We propose a novel deep neural network architecture for speech recognition that explicitly employs knowledge of the background environmental noise within a deep neural network acoustic model.
no code implementations • 19 Nov 2015 • Suyoun Kim, Ian Lane
Integration of multiple microphone data is one of the key ways to achieve robust speech recognition in noisy environments or when the speaker is located at some distance from the input device.
no code implementations • 9 Dec 2014 • Seungwhan Moon, Suyoun Kim, Haohan Wang
We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality.