Speech Synthesis

294 papers with code • 4 benchmarks • 19 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Libraries

Use these libraries to find Speech Synthesis models and implementations

Most implemented papers

Deep Voice: Real-time Neural Text-to-Speech

NVIDIA/nv-wavenet ICML 2017

We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks.

Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

NVIDIA/OpenSeq2Seq 25 May 2018

We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training.

High Fidelity Speech Synthesis with Adversarial Networks

mbinkowski/DeepSpeechDistances ICLR 2020

However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech.

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

coqui-ai/TTS 23 Oct 2019

Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text.

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

NVIDIA/flowtron ICLR 2021

In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.

SpeedySpeech: Efficient Neural Speech Synthesis

janvainer/speedyspeech 9 Aug 2020

While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time.

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

maum-ai/wavegrad2 17 Jun 2021

The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.

One TTS Alignment To Rule Them All

coqui-ai/TTS 23 Aug 2021

However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words.

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

microsoft/speecht5 ACL 2022

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

coqui-ai/TTS 4 Dec 2021

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.