Voice Conversion
149 papers with code • 2 benchmarks • 5 datasets
Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.
Libraries
Use these libraries to find Voice Conversion models and implementationsLatest papers with no code
Transfer the linguistic representations from TTS to accent conversion with non-parallel data
This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech.
StreamVC: Real-Time Low-Latency Voice Conversion
We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech.
CoMoSVC: Consistency Model-based Singing Voice Conversion
The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre.
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion
We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content.
AE-Flow: AutoEncoder Normalizing Flow
The results show that the proposed training paradigm systematically improves speaker similarity and naturalness when compared to regular training methods of normalizing flows.
Exploring data augmentation in bias mitigation against non-native-accented speech
We aim to mitigate the bias against non-native-accented Flemish in a Flemish ASR system.
Creating New Voices using Normalizing Flows
As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities.
SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention
Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged.
PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models
Perceptual modification of voice is an elusive goal.
Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes
From the publicly available speech dataset LibriTTS, we also created a separate database of only audio deepfakes LibriTTS-DF using several latest text to speech methods: YourTTS, Adaspeech, and TorToiSe.