Speech Synthesis

291 papers with code • 4 benchmarks • 19 datasets

Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.

Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.

( Image credit: WaveNet: A generative model for raw audio )

Libraries

Use these libraries to find Speech Synthesis models and implementations

Most implemented papers

Exploring Transfer Learning for Low Resource Emotional TTS

Emotional-Text-to-Speech/dl-for-emo-tts Advances in Intelligent Systems and Computing 2019

During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning.

MelNet: A Generative Model for Audio in the Frequency Domain

fatchord/MelNet 4 Jun 2019

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps.

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

r9y9/gantts 23 Sep 2017

In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

Tools and resources for Romanian text-to-speech and speech-to-text applications

racai-ai/TEPROLIN 15 Feb 2018

In this paper we introduce a set of resources and tools aimed at providing support for natural language processing, text-to-speech synthesis and speech recognition for Romanian.

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

PaddlePaddle/DeepSpeech 9 Jul 2019

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.

DurIAN: Duration Informed Attention Network For Multimodal Synthesis

ivanvovk/durian-pytorch 4 Sep 2019

In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously.

WaveFlow: A Compact Flow-based Model for Raw Audio

PaddlePaddle/Parakeet ICML 2020

WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases.

fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

pytorch/fairseq 14 Sep 2021

This paper presents fairseq S^2, a fairseq extension for speech synthesis.

Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

huawei-noah/Speech-Backbones ICLR 2022

Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario.

A Critical Review of Recurrent Neural Networks for Sequence Learning

junwang23/deepdirtycodes 29 May 2015

Recurrent neural networks (RNNs) are connectionist models that capture the dynamics of sequences via cycles in the network of nodes.