Search Results for author: RJ Skerry-Ryan

Found 16 papers, 7 papers with code

Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

no code implementations • 24 May 2023 • Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, Michelle Tadmor Ramanovich

Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass.

Language Modelling Question Answering +3

Paper
Add Code

Learning the joint distribution of two sequences using little or no paired data

no code implementations • 6 Dec 2022 • Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao, Daisy Stanton, Eric Battenberg, RJ Skerry-Ryan

We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available.

Variational Inference

Paper
Add Code

Speaker Generation

no code implementations • 7 Nov 2021 • Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task.

Transfer Learning

Paper
Add Code

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

no code implementations • 6 Nov 2020 • Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

Non-saturating GAN training as divergence minimization

no code implementations • 15 Oct 2020 • Matt Shannon, Ben Poole, Soroosh Mariooryad, Tom Bagby, Eric Battenberg, David Kao, Daisy Stanton, RJ Skerry-Ryan

Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results.

Generative Adversarial Network

Paper
Add Code

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

3 code implementations • 23 Oct 2019 • Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby

Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text.

Speech Synthesis

29,183

Paper
Code

Semi-Supervised Generative Modeling for Controllable Speech Synthesis

no code implementations • ICLR 2020 • Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models.

Speech Synthesis

Paper
Add Code

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

4 code implementations • 9 Jul 2019 • Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.

Speech Synthesis Voice Cloning

10,134

Paper
Code

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

1 code implementation • 8 Jun 2019 • Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods.

Expressive Speech Synthesis Style Transfer

Paper
Code

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

no code implementations • 30 Aug 2018 • Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

Speech Synthesis

Paper
Add Code

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

no code implementations • 4 Aug 2018 • Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

2 code implementations • ICML 2018 • RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.

Expressive Speech Synthesis

368

Paper
Code

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

11 code implementations • ICML 2018 • Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.

Speech Synthesis Style Transfer +1

10,131

Paper
Code

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

30 code implementations • 16 Dec 2017 • Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.

Ranked #2 on Speech Synthesis on North American English

Speech Synthesis

29,183

Paper
Code

Uncovering Latent Style Factors for Expressive Speech Synthesis

no code implementations • 1 Nov 2017 • Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous

Prosodic modeling is a core problem in speech synthesis.

Expressive Speech Synthesis

Paper
Add Code

Tacotron: Towards End-to-End Speech Synthesis

29 code implementations • 29 Mar 2017 • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Ranked #5 on Speech Synthesis on North American English

Audio Synthesis Speech Synthesis +1

50,743

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.