Speech Synthesis
290 papers with code • 4 benchmarks • 19 datasets
Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
Libraries
Use these libraries to find Speech Synthesis models and implementationsDatasets
Subtasks
- Expressive Speech Synthesis
- Emotional Speech Synthesis
- text-to-speech translation
- Speech Synthesis - Tamil
- Speech Synthesis - Tamil
- Speech Synthesis - Kannada
- Speech Synthesis - Malayalam
- Speech Synthesis - Telugu
- Speech Synthesis - Assamese
- Speech Synthesis - Bengali
- Speech Synthesis - Bodo
- Speech Synthesis - Gujarati
- Speech Synthesis - Hindi
- Speech Synthesis - Manipuri
- Speech Synthesis - Marathi
- Speech Synthesis - Rajasthani
Latest papers
HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks
In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic.
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis
This study focuses on the creation of the KazEmoTTS dataset, designed for emotional Kazakh text-to-speech (TTS) applications.
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models
The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis.
Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation
Contemporary conversational systems often present a significant limitation: their responses lack the emotional depth and disfluent characteristic of human interactions.
Towards Decoding Brain Activity During Passive Listening of Speech
The aim of the study is to investigate the complex mechanisms of speech perception and ultimately decode the electrical changes in the brain accruing while listening to speech.
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection
The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms.
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism
Our method, which we call NEUral Text to ARticulate Talk (NEUTART), is a talking face generator that uses a joint audiovisual feature space, as well as speech-informed 3D facial reconstructions and a lip-reading loss for visual supervision.
Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech
In this work, we propose to learn the AV representation from categorical emotion labels of speech.
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis
Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.