Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices... (read more)

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Speech Synthesis North American English Tacotron Mean Opinion Score 4.001 # 4

Methods used in the Paper


METHOD TYPE
Griffin-Lim Algorithm
Phase Reconstruction
Sigmoid Activation
Activation Functions
Highway Layer
Miscellaneous Components
Residual GRU
Recurrent Neural Networks
BiGRU
Bidirectional Recurrent Neural Networks
Highway Network
Feedforward Networks
Residual Connection
Skip Connections
Convolution
Convolutions
ReLU
Activation Functions
Batch Normalization
Normalization
Dropout
Regularization
Dense Connections
Feedforward Networks
Tanh Activation
Activation Functions
Max Pooling
Pooling Operations
Additive Attention
Attention Mechanisms
GRU
Recurrent Neural Networks
Step Decay
Learning Rate Schedules
Adam
Stochastic Optimization
CBHG
Speech Synthesis Blocks
Tacotron
Text-to-Speech Models