We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.
During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning.
EMOTIONAL SPEECH SYNTHESIS EXPRESSIVE SPEECH SYNTHESIS TEXT-TO-SPEECH SYNTHESIS TRANSFER LEARNING
The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques.
EMOTIONAL SPEECH SYNTHESIS EXPRESSIVE SPEECH SYNTHESIS LATENT VARIABLE MODELS LEARNING NETWORK REPRESENTATIONS SPEECH EMOTION RECOGNITION TEXT-TO-SPEECH SYNTHESIS
We propose prosody embeddings for emotional and expressive speech synthesis networks.