Search Results for author: Dongchao Yang

Found 14 papers, 4 papers with code

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

no code implementations • 4 Apr 2024 • Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

Language Modelling Speech Synthesis +1

Paper
Add Code

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

no code implementations • 5 Mar 2024 • Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt.

Quantization Speech Synthesis

Paper
Add Code

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

no code implementations • 24 Dec 2023 • Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

In this paper, we present CaRE-SEP, a consistent and relevant embedding network for general sound separation to encourage a comprehensive reconsideration of query usage in audio separation.

Paper
Add Code

DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

1 code implementation • 6 Oct 2023 • Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background.

Target Sound Extraction

Paper
Code

PromptTTS 2: Describing and Generating Voices with Text Prompt

no code implementations • 5 Sep 2023 • Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech.

Language Modelling Large Language Model

Paper
Add Code

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

no code implementations • 30 May 2023 • Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common.

Singing Voice Synthesis Voice Conversion

Paper
Add Code

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

no code implementations • 29 May 2023 • Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao

Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data.

Ranked #7 on Audio Generation on AudioCaps

Audio Generation Denoising +2

Paper
Add Code

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

1 code implementation • 25 Apr 2023 • Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i. e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue.

9,837

Paper
Code

Improving Weakly Supervised Sound Event Detection with Causal Intervention

no code implementations • 10 Mar 2023 • Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, Yuexian Zou

Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i. e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision.

Event Detection Sound Event Detection

Paper
Add Code

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

1 code implementation • 30 Jan 2023 • Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao

Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.

Ranked #11 on Audio Generation on AudioCaps

Audio Generation Text-to-Video Generation +1

702

Paper
Code

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

1 code implementation • 20 Jul 2022 • Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.

Ranked #13 on Audio Generation on AudioCaps

Audio Generation Decoder

335

Paper
Code

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

no code implementations • 15 Apr 2022 • Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou

Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered.

Domain Adaptation

Paper
Add Code

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

no code implementations • 4 Apr 2022 • Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings.

blind source separation Metric Learning +2

Paper
Add Code

Towards Data Distillation for End-to-end Spoken Conversational Question Answering

no code implementations • 18 Oct 2020 • Chenyu You, Nuo Chen, Fenglin Liu, Dongchao Yang, Yuexian Zou

In spoken question answering, QA systems are designed to answer questions from contiguous text spans within the related speech transcripts.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.