In this work, we propose "global style tokens" (GSTs), a bank of embeddings
that are jointly trained within Tacotron, a state-of-the-art end-to-end speech
synthesis system. The embeddings are trained with no explicit labels, yet learn
to model a large range of acoustic expressiveness...
GSTs lead to a rich set of
significant results. The soft interpretable "labels" they generate can be used
to control synthesis in novel ways, such as varying speed and speaking style -
independently of the text content. They can also be used for style transfer,
replicating the speaking style of a single audio clip across an entire
long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn
to factorize noise and speaker identity, providing a path towards highly
scalable but robust speech synthesis.