Voice Conversion
149 papers with code • 2 benchmarks • 5 datasets
Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.
Libraries
Use these libraries to find Voice Conversion models and implementationsMost implemented papers
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks
In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.
Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme
Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario.
Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion
To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2.
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
One-class learning towards generalized voice spoofing detection
Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion.
MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames
With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames.
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations
AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2. 0 is used in FragmentVC to extract the phonetic content information.
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.