TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Text-To-Speech Synthesis	LJSpeech	FastDiff (4 steps)	Audio Quality MOS	4.28	# 7
Text-To-Speech Synthesis	LJSpeech	FastDiff-TTS	Audio Quality MOS	4.03	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/fastdiff-a-fast-conditional-diffusion-model/text-to-speech-synthesis-on-ljspeech)](https://paperswithcode.com/sota/text-to-speech-synthesis-on-ljspeech?p=fastdiff-a-fast-conditional-diffusion-model)`

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

21 Apr 2022 · Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao ·

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.

PDF Abstract

Code

Add Remove Mark official

Rongjiehuang/FastDiff official

↳ Quickstart in

Spaces

390

Rongjiehuang/ProDiff

↳ Quickstart in

Spaces

423

Tasks

Add Remove

Denoising

Speech Synthesis

Text-To-Speech Synthesis

Vocal Bursts Intensity Prediction

Datasets

VCTK

LJSpeech

Results from the Paper

Edit

Ranked #7 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Uses Extra Training Data	Result	Benchmark
Text-To-Speech Synthesis	LJSpeech	FastDiff (4 steps)	Audio Quality MOS	4.28	# 7			Compare
Text-To-Speech Synthesis	LJSpeech	FastDiff-TTS	Audio Quality MOS	4.03	# 8			Compare

Methods

Add Remove

Diffusion • SPEED

Edit Social Preview

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove