Expressive Speech Synthesis

11 papers with code • 0 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Latest papers with no code

Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

no code yet • 26 Oct 2023

This paper aims to build an expressive TTS system for multi-speakers, synthesizing a target speaker's speech with multiple styles and emotions.

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

no code yet • 31 Aug 2023

The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style.

Cross-lingual Prosody Transfer for Expressive Machine Dubbing

no code yet • 20 Jun 2023

Prosody transfer is well-studied in the context of expressive speech synthesis.

Ensemble prosody prediction for expressive speech synthesis

no code yet • 3 Apr 2023

Generating expressive speech with rich and varied prosody continues to be a challenge for Text-to-Speech.

On granularity of prosodic representations in expressive text-to-speech

no code yet • 26 Jan 2023

In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training.

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

no code yet • 19 Nov 2022

This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers.

Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

no code yet • 2 Nov 2022

A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference.

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

no code yet • 1 Nov 2022

We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations.

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

no code yet • 25 Jun 2022

In this paper, we propose a novel framework for learning style representation from abundant plain text in a self-supervised manner.

Fine-grained Noise Control for Multispeaker Speech Synthesis

no code yet • 11 Apr 2022

A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations. Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i. e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise. This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling.