no code implementations • 6 Dec 2022 • Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao, Daisy Stanton, Eric Battenberg, RJ Skerry-Ryan
We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available.
1 code implementation • 26 May 2022 • Ehsan Variani, Ke wu, Michael Riley, David Rybach, Matt Shannon, Cyril Allauzen
We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition.
no code implementations • 7 Nov 2021 • Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao
We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task.
no code implementations • 15 Oct 2020 • Matt Shannon, Ben Poole, Soroosh Mariooryad, Tom Bagby, Eric Battenberg, David Kao, Daisy Stanton, RJ Skerry-Ryan
Non-saturating generative adversarial network (GAN) training is widely used and has continued to obtain groundbreaking results.
no code implementations • 2 Sep 2020 • Matt Shannon
In this technical report we describe some properties of f-divergences and f-GAN training.
3 code implementations • 23 Oct 2019 • Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text.
no code implementations • ICLR 2020 • Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby
We present a novel generative model that combines state-of-the-art neural text-to-speech (TTS) with semi-supervised probabilistic latent variable models.
no code implementations • 25 Sep 2019 • Matt Shannon
The original variant is theoretically easier to study, but for GANs the alternative variant performs better in practice.
1 code implementation • 8 Jun 2019 • Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby
Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods.
no code implementations • 8 Jun 2017 • Matt Shannon
State-level minimum Bayes risk (sMBR) training has become the de facto standard for sequence-level training of speech recognition acoustic models.