Expressive Speech Synthesis
11 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in Expressive Speech Synthesis
Latest papers with no code
Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning
This paper aims to build an expressive TTS system for multi-speakers, synthesizing a target speaker's speech with multiple styles and emotions.
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style.
Cross-lingual Prosody Transfer for Expressive Machine Dubbing
Prosody transfer is well-studied in the context of expressive speech synthesis.
Ensemble prosody prediction for expressive speech synthesis
Generating expressive speech with rich and varied prosody continues to be a challenge for Text-to-Speech.
On granularity of prosodic representations in expressive text-to-speech
In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training.
Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling
This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers.
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis
A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference.
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis
We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations.
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis
In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner.
Fine-grained Noise Control for Multispeaker Speech Synthesis
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations. Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i. e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise. This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling.