Text-To-Speech Synthesis
93 papers with code • 6 benchmarks • 17 datasets
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Libraries
Use these libraries to find Text-To-Speech Synthesis models and implementationsDatasets
Most implemented papers
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.
Neural Speech Synthesis with Transformer Network
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
Exploring Transfer Learning for Low Resource Emotional TTS
During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning.
MelNet: A Generative Model for Audio in the Frequency Domain
Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps.
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
By leveraging the properties of flows, MAS searches for the most probable monotonic alignment between text and the latent representation of speech.
Tools and resources for Romanian text-to-speech and speech-to-text applications
In this paper we introduce a set of resources and tools aimed at providing support for natural language processing, text-to-speech synthesis and speech recognition for Romanian.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.