Search Results for author: Gary Wang

Found 12 papers, 0 papers with code

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

no code implementations • 29 Feb 2024 • Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth).

Representation Learning Speech Synthesis

Paper
Add Code

High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

no code implementations • 8 Jan 2024 • Christopher Li, Gary Wang, Kyle Kastner, Heng Su, Allen Chen, Andrew Rosenberg, Zhehuai Chen, Zelin Wu, Leonid Velikovich, Pat Rondon, Diamantino Caseiro, Petar Aleksic

In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Using Text Injection to Improve Recognition of Personal Identifiers in Speech

no code implementations • 14 Aug 2023 • Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran

We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Understanding Shared Speech-Text Representations

no code implementations • 27 Apr 2023 • Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

no code implementations • 2 Mar 2023 • Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Modular Hybrid Autoregressive Transducer

no code implementations • 31 Oct 2022 • Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno

In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder.

Decoder Language Modelling +2

Paper
Add Code

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

no code implementations • 27 Oct 2022 • Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran

This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

no code implementations • 19 Oct 2022 • Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training.

Ranked #1 on Speech Recognition on CHiME-6 eval

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Non-Parallel Voice Conversion for ASR Augmentation

no code implementations • 15 Sep 2022 • Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yinghui Huang, Jesse Emond, Pedro Moreno Mengibar

For ASR augmentation, it is necessary that the VC model be robust to a wide range of input speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

no code implementations • 16 May 2022 • Alëna Aksënova, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Levi King, Bhuvana Ramabhadran, Andrew Rosenberg, Suzan Schwartz, Gary Wang

However, there are not enough data sets for accented speech, and for the ones that are already available, more training approaches need to be explored to improve the quality of accented speech recognition.

Accented Speech Recognition Benchmarking +1

Paper
Add Code

Injecting Text in Self-Supervised Speech Pretraining

no code implementations • 27 Aug 2021 • Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno

The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text.

Contrastive Learning Language Modelling +2

Paper
Add Code

Deep Text-to-Speech System with Seq2Seq Model

no code implementations • 11 Mar 2019 • Gary Wang

Recent trends in neural network based text-to-speech/speech synthesis pipelines have employed recurrent Seq2seq architectures that can synthesize realistic sounding speech directly from text characters.

Speech Synthesis

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.