|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Ranked #4 on Speech Synthesis on North American English
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units.
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).
Ranked #1 on Text-To-Speech Synthesis on LJSpeech
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.
Ranked #1 on Text-To-Speech Synthesis on LJSpeech (Pleasantness MOS metric)