Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech

ICLR 2021 · Yoonhyung Lee, Joongbo Shin, Kyomin Jung ·

Although early text-to-speech (TTS) models such as Tacotron 2 have succeeded in generating human-like speech, their autoregressive (AR) architectures have a limitation that they require a lot of time to generate a mel-spectrogram consisting of hundreds of steps. In this paper, we propose a novel non-autoregressive TTS model called BVAE-TTS, which eliminates the architectural limitation and generates a mel-spectrogram in parallel. BVAE-TTS adopts bidirectional-inference variational autoencoder (BVAE) that learns hierarchical latent representations using both bottom-up and top-down paths to increase its expressiveness. To apply BVAE to TTS, we design our model to utilize text information via an attention mechanism. By using attention maps that BVAE-TTS generates, we train a duration predictor so that the model uses the predicted length of each phoneme at inference. In experiments conducted on LJSpeech dataset, we show that our model generates a mel-spectrogram 27 times faster than Tacotron 2 with similar speech quality. Furthermore, our BVAE-TTS outperforms Glow-TTS, which is one of the state-of-the-art non-autoregressive TTS models, in terms of both speech quality and inference speed while having 58% fewer parameters.

PDF Abstract