Text-to-Speech Models

ParaNet

Introduced by Peng et al. in Non-Autoregressive Neural Text-to-Speech

ParaNet is a non-autoregressive attention-based architecture for text-to-speech, which is fully convolutional and converts text to mel spectrogram. ParaNet distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-by-layer manner. The architecture is otherwise similar to Deep Voice 3 except these changes to the decoder; whereas the decoder of DV3 has multiple attention-based layers, where each layer consists of a causal convolution block followed by an attention block, ParaNet has a single attention block in the encoder.

Source: Non-Autoregressive Neural Text-to-Speech

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
GPR 1 33.33%
point cloud upsampling 1 33.33%
Text-To-Speech Synthesis 1 33.33%

Categories