Text-To-Speech Synthesis
93 papers with code • 6 benchmarks • 17 datasets
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Libraries
Use these libraries to find Text-To-Speech Synthesis models and implementationsDatasets
Most implemented papers
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel.
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech
Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling.
Non-Autoregressive Neural Text-to-Speech
In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram.
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
Our NN predicts MOS with a high correlation to human judgments.
End-to-End Adversarial Text-to-Speech
Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest.
Rich Prosody Diversity Modelling with Phone-level Mixture Density Network
Generating natural speech with diverse and smooth prosody pattern is a challenging task.
Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling
However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN).
RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis
In order to meet the need for a high quality, publicly available male speech corpus within the field of speech recognition, we have designed and created RyanSpeech which contains textual materials from real-world conversational settings.