OverFlow: Putting flows on top of neural transducers for better TTS

13 Nov 2022  ·  Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter ·

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code.

PDF Abstract

Datasets


Results from the Paper


Ranked #11 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Text-To-Speech Synthesis LJSpeech OverFlow Audio Quality MOS 3.37 # 11
Word Error Rate (WER) 2.30 # 1

Methods


No methods listed for this paper. Add relevant methods here