no code implementations • 15 Sep 2023 • Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana
We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions.
no code implementations • 23 May 2023 • Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari
We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interlocutor's emotion.
no code implementations • 23 May 2023 • Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue.
no code implementations • 28 Oct 2022 • Yuma Shirahata, Ryuichi Yamamoto, Eunwoo Song, Ryo Terashima, Jae-Min Kim, Kentaro Tachibana
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
1 code implementation • 28 Oct 2022 • Reo Yoneyama, Ryuichi Yamamoto, Kentaro Tachibana
Neural audio super-resolution models are typically trained on low- and high-resolution audio signal pairs.
1 code implementation • 28 Oct 2022 • Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana
We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform.
no code implementations • 16 Jun 2022 • Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling.
no code implementations • 21 Apr 2022 • Ryo Terashima, Ryuichi Yamamoto, Eunwoo Song, Yuma Shirahata, Hyun-Wook Yoon, Jae-Min Kim, Kentaro Tachibana
Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models, even when only 1, 000 utterances of the target speaker's neutral data are available.
no code implementations • 28 Mar 2022 • Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
We describe our methodology to construct an empathetic dialogue speech corpus and report the analysis results of the STUDIES corpus.
1 code implementation • 26 Apr 2021 • Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana
We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a. k. a BERT, and explicit features extracted from BiLSTM with linguistic features.
no code implementations • 6 Sep 2018 • Koichi Hamada, Kentaro Tachibana, Tianqi Li, Hiroto Honda, Yusuke Uchida
Our method tackles the limitations by progressively increasing the resolution of both generated images and structural conditions during training.