Text-To-Speech Synthesis

93 papers with code • 6 benchmarks • 17 datasets

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Benchmarks

Add a Result

These leaderboards are used to track progress in Text-To-Speech Synthesis

Dataset	Best Model	Compare
LJSpeech	NaturalSpeech	See all
CMUDict 0.7b	Token-Level Ensemble Distillation	See all
20000 utterances	Mia	See all
HUI speech corpus	Tacotron 2	See all
Thorsten voice 21.02 neutral	Tacotron 2	See all
Trinity Speech-Gesture Dataset	Match-TTSG	See all

Libraries

Use these libraries to find Text-To-Speech Synthesis models and implementations

PaddlePaddle/PaddleSpeech

12 papers

10,154

coqui-ai/TTS

10 papers

29,314

keonlee9420/Expressive-FastSpeech2

5 papers

259

TensorSpeech/TensorflowTTS

4 papers

3,701

See all 12 libraries.

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

natspeech/natspeech • • NeurIPS 2021

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel.

Paper
Code

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

coqui-ai/TTS • • 4 Dec 2021

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.

Paper
Code

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

microsoft/NeuralSpeech • • 9 May 2022

In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.

Paper
Code

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Rongjiehuang/ProDiff • • 13 Jul 2022

Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling.

Paper
Code

Non-Autoregressive Neural Text-to-Speech

ksw0306/WaveVAE • • ICML 2020

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram.

Paper
Code

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

rhoposit/MOS_Estimation • • 28 Feb 2020

Our NN predicts MOS with a high correlation to human judgments.

Paper
Code

End-to-End Adversarial Text-to-Speech

yanggeng1995/EATS • • ICLR 2021

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest.

Paper
Code

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

keonlee9420/Comprehensive-Transformer-TTS • • 1 Feb 2021

Generating natural speech with diverse and smooth prosody pattern is a challenging task.

Paper
Code

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

keonlee9420/Expressive-FastSpeech2 • • 11 Jun 2021

However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN).

Paper
Code

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

roholazandie/ryan-tts • • 15 Jun 2021

In order to meet the need for a high quality, publicly available male speech corpus within the field of speech recognition, we have designed and created RyanSpeech which contains textual materials from real-world conversational settings.

Paper
Code

Text-To-Speech Synthesis

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result