Text-To-Speech Synthesis

92 papers with code • 6 benchmarks • 17 datasets

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Libraries

Use these libraries to find Text-To-Speech Synthesis models and implementations

Latest papers with no code

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

no code yet • 4 Apr 2024

Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders

no code yet • 3 Apr 2024

Neural speech codec has recently gained widespread attention in generative speech modeling domains, like voice conversion, text-to-speech synthesis, etc.

Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting

no code yet • 19 Feb 2024

Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker factored approximations produces a better preservation of the pre-training knowledge than the diagonal ones.

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

no code yet • 10 Jan 2024

The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately.

Boosting Large Language Model for Speech Synthesis: An Empirical Study

no code yet • 30 Dec 2023

In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder.

Normalization of Lithuanian Text Using Regular Expressions

no code yet • 29 Dec 2023

The taxonomy of semiotic classes adapted to the Lithuanian language is presented in the work.

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

no code yet • 17 Dec 2023

The challenges of modeling such a multi-modal style controllable TTS mainly lie in two aspects:1)aligning the multi-modal information into a unified style space to enable the input of arbitrary modality as the style prompt in a single system, and 2)efficiently transferring the unified style representation into the given text content, thereby empowering the ability to generate prompt style-related voice.

An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis

no code yet • 8 Dec 2023

We propose a new model architecture specifically suited for text-to-speech (TTS) models.

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

no code yet • 6 Dec 2023

Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process.

Code-Mixed Text to Speech Synthesis under Low-Resource Constraints

no code yet • 2 Dec 2023

We further present an exhaustive evaluation of single-speaker adaptation and multi-speaker training with Tacotron2 + Waveglow setup to show that the former approach works better.