Text-To-Speech Synthesis
93 papers with code • 6 benchmarks • 17 datasets
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Libraries
Use these libraries to find Text-To-Speech Synthesis models and implementationsDatasets
Latest papers with no code
Code-Mixed Text to Speech Synthesis under Low-Resource Constraints
We further present an exhaustive evaluation of single-speaker adaptation and multi-speaker training with Tacotron2 + Waveglow setup to show that the former approach works better.
Guided Flows for Generative Modeling and Decision Making
Classifier-free guidance is a key component for enhancing the performance of conditional generative models across diverse tasks.
Generative Pre-training for Speech with Flow Matching
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data.
Unified speech and gesture synthesis using flow matching
As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures.
The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains
We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech.
DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis.
The FruitShell French synthesis system at the Blizzard 2023 Challenge
The evaluation results of our system showed a quality MOS score of 3. 6 for the Hub task and 3. 4 for the Spoke task, placing our system at an average level among all participating teams.
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style.
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis
In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features.
Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space.