no code implementations • 12 Oct 2023 • Ju-chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli
However, in the field of language modeling, very little effort has been made to model them jointly.
no code implementations • 9 Oct 2023 • Chung-Ming Chien, Mingjiamei Zhang, Ju-chieh Chou, Karen Livescu
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space.
no code implementations • 14 Sep 2023 • Ju-chieh Chou, Chung-Ming Chien, Karen Livescu
In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data.
1 code implementation • 30 Jun 2023 • Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu
Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks.
no code implementations • 16 Feb 2022 • Adam Gabryś, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime Lorenzo-Trueba
It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system and marks a conceptual shift in the existing TTS paradigm, framing the few-shot TTS problem as a VC task.
3 code implementations • 7 Apr 2021 • Jheng-Hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-Yi Lee
AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2. 0 is used in FragmentVC to extract the phonetic content information.
1 code implementation • 6 Mar 2021 • Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-Yi Lee
The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples.
1 code implementation • 12 Nov 2020 • Chung-Ming Chien, Hung-Yi Lee
Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks.
2 code implementations • 27 Oct 2020 • Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-Yi Lee, Lin-shan Lee
Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios.