Text-To-Speech Synthesis

93 papers with code • 6 benchmarks • 17 datasets

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Libraries

Use these libraries to find Text-To-Speech Synthesis models and implementations

Most implemented papers

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

natspeech/natspeech NeurIPS 2021

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel.

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

coqui-ai/TTS 4 Dec 2021

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

microsoft/NeuralSpeech 9 May 2022

In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Rongjiehuang/ProDiff 13 Jul 2022

Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling.

Non-Autoregressive Neural Text-to-Speech

ksw0306/WaveVAE ICML 2020

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram.

End-to-End Adversarial Text-to-Speech

yanggeng1995/EATS ICLR 2021

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest.

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

keonlee9420/Comprehensive-Transformer-TTS 1 Feb 2021

Generating natural speech with diverse and smooth prosody pattern is a challenging task.

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

keonlee9420/Expressive-FastSpeech2 11 Jun 2021

However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN).

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

roholazandie/ryan-tts 15 Jun 2021

In order to meet the need for a high quality, publicly available male speech corpus within the field of speech recognition, we have designed and created RyanSpeech which contains textual materials from real-world conversational settings.