Speech Synthesis
294 papers with code • 4 benchmarks • 19 datasets
Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
Libraries
Use these libraries to find Speech Synthesis models and implementationsDatasets
Subtasks
- Expressive Speech Synthesis
- Emotional Speech Synthesis
- text-to-speech translation
- Speech Synthesis - Tamil
- Speech Synthesis - Tamil
- Speech Synthesis - Kannada
- Speech Synthesis - Malayalam
- Speech Synthesis - Telugu
- Speech Synthesis - Assamese
- Speech Synthesis - Bengali
- Speech Synthesis - Bodo
- Speech Synthesis - Gujarati
- Speech Synthesis - Hindi
- Speech Synthesis - Manipuri
- Speech Synthesis - Marathi
- Speech Synthesis - Rajasthani
Most implemented papers
Deep Voice: Real-time Neural Text-to-Speech
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks.
Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq
We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training.
High Fidelity Speech Synthesis with Adversarial Networks
However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech.
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.
SpeedySpeech: Efficient Neural Speech Synthesis
While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time.
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform.
One TTS Alignment To Rule Them All
However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words.
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.