Learning Speaker Embedding from Text-to-Speech

Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification... (read more)

PDF Abstract

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper


METHOD TYPE
Sigmoid Activation
Activation Functions
Dilated Causal Convolution
Temporal Convolutions
Highway Layer
Miscellaneous Components
BiGRU
Bidirectional Recurrent Neural Networks
Mixture of Logistic Distributions
Output Functions
LSTM
Recurrent Neural Networks
BiLSTM
Bidirectional Recurrent Neural Networks
GRU
Recurrent Neural Networks
Max Pooling
Pooling Operations
Convolution
Convolutions
Dense Connections
Feedforward Networks
Tanh Activation
Activation Functions
WaveNet
Generative Audio Models
Linear Layer
Feedforward Networks
Highway Network
Feedforward Networks
Zoneout
Regularization
Dropout
Regularization
Batch Normalization
Normalization
Residual Connection
Skip Connections
CBHG
Speech Synthesis Blocks
Residual GRU
Recurrent Neural Networks
Griffin-Lim Algorithm
Phase Reconstruction
Additive Attention
Attention Mechanisms
Location Sensitive Attention
Attention Mechanisms
Tacotron 2
Text-to-Speech Models
ReLU
Activation Functions
Tacotron
Text-to-Speech Models