Text-To-Speech Synthesis
92 papers with code • 6 benchmarks • 17 datasets
Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.
Libraries
Use these libraries to find Text-To-Speech Synthesis models and implementationsDatasets
Latest papers with no code
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.
PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders
Neural speech codec has recently gained widespread attention in generative speech modeling domains, like voice conversion, text-to-speech synthesis, etc.
Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting
Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker factored approximations produces a better preservation of the pre-training knowledge than the diagonal ones.
Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately.
Boosting Large Language Model for Speech Synthesis: An Empirical Study
In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder.
Normalization of Lithuanian Text Using Regular Expressions
The taxonomy of semiotic classes adapted to the Lithuanian language is presented in the work.
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis
The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2)efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice.
An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis
We propose a new model architecture specifically suited for text-to-speech (TTS) models.
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process.
Code-Mixed Text to Speech Synthesis under Low-Resource Constraints
We further present an exhaustive evaluation of single-speaker adaptation and multi-speaker training with Tacotron2 + Waveglow setup to show that the former approach works better.