Paper

An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems

End-to-end models, particularly Tacotron-based ones, are currently a popular solution for text-to-speech synthesis. They allow the production of high-quality synthesized speech with little to no text preprocessing. Indeed, they can be trained using either graphemes or phonemes as input directly. However, in the case of grapheme inputs, little is known concerning the relation between the underlying representations learned by the model and word pronunciations. This work investigates this relation in the case of a Tacotron model trained on French graphemes. Our analysis shows that grapheme embeddings are related to phoneme information despite no such information being present during training. Thanks to this property, we show that grapheme embeddings learned by Tacotron models can be useful for tasks such as grapheme-to-phoneme conversion and control of the pronunciation in synthetic speech.

Results in Papers With Code
(↓ scroll down to see all results)