Voice Conversion

149 papers with code • 2 benchmarks • 5 datasets

Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

Source: Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

Libraries

Use these libraries to find Voice Conversion models and implementations
3 papers
7,875
3 papers
2,092
See all 5 libraries.

Most implemented papers

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

r9y9/gantts 23 Sep 2017

In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

jjery2243542/voice_conversion 9 Apr 2018

The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.

Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

huawei-noah/Speech-Backbones ICLR 2022

Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario.

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

liusongxiang/StarGAN-Voice-Conversion NeurIPS 2019

End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

SamuelBroughton/StarGAN-Voice-Conversion-2 29 Jul 2019

To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2.

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

espnet/espnet 6 Oct 2020

This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.

One-class learning towards generalized voice spoofing detection

yzyouzhang/AIR-ASVspoof 27 Oct 2020

Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion.

MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

GANtastic3/MaskCycleGAN-VC 25 Feb 2021

With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames.

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

s3prl/s3prl 7 Apr 2021

AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2. 0 is used in FragmentVC to extract the phonetic content information.

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

microsoft/speecht5 ACL 2022

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.