Voice Conversion

149 papers with code • 2 benchmarks • 5 datasets

Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

Source: Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

Benchmarks

Add a Result

These leaderboards are used to track progress in Voice Conversion

Trend	Dataset	Best Model	Paper	Code	Compare
	ZeroSpeech 2019 English	VQ-CPC			See all
	LibriSpeech test-clean	kNN-VC (prematched HiFiGAN)			See all

Libraries

Use these libraries to find Voice Conversion models and implementations

espnet/espnet

3 papers

7,875

s3prl/s3prl

3 papers

2,092

andi611/Self-Supervised-Speech-Pret…

3 papers

2,092

unilight/seq2seq-vc

3 papers

See all 5 libraries.

Datasets

Most implemented papers

Most implemented Social Latest No code

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

r9y9/gantts • • 23 Sep 2017

In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

Paper
Code

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

jjery2243542/voice_conversion • • 9 Apr 2018

The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.

Paper
Code

Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

huawei-noah/Speech-Backbones • • ICLR 2022

Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario.

Paper
Code

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

liusongxiang/StarGAN-Voice-Conversion • • NeurIPS 2019

End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.

Paper
Code

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

SamuelBroughton/StarGAN-Voice-Conversion-2 • • 29 Jul 2019

To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2.

Paper
Code

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

espnet/espnet • • 6 Oct 2020

This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.

Paper
Code

One-class learning towards generalized voice spoofing detection

yzyouzhang/AIR-ASVspoof • • 27 Oct 2020

Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion.

Paper
Code

MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

GANtastic3/MaskCycleGAN-VC • • 25 Feb 2021

With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames.

Paper
Code

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

s3prl/s3prl • • 7 Apr 2021

AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2. 0 is used in FragmentVC to extract the phonetic content information.

Paper
Code

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

microsoft/speecht5 • • ACL 2022

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.

Paper
Code

Voice Conversion

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result