no code implementations • 19 Sep 2023 • Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen
Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.
no code implementations • 22 Jul 2023 • Suyoun Kim, Akshat Shrivastava, Duc Le, Ju Lin, Ozlem Kalinli, Michael L. Seltzer
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
no code implementations • 15 Dec 2022 • Ke Li, Jay Mahadeokar, Jinxi Guo, Yangyang Shi, Gil Keren, Ozlem Kalinli, Michael L. Seltzer, Duc Le
Experiments on Librispeech and in-house data show relative WER reductions (WERRs) from 3% to 5% with a slight increase in model size and negligible extra token emission latency compared with fast-slow encoder based transducer.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 10 Nov 2022 • Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, Michael L. Seltzer
Later, we use our optimal tokenization strategy to train multiple embedding and output model to further improve our result.
no code implementations • 2 Nov 2022 • Duc Le, Frank Seide, Yuhao Wang, Yang Li, Kjell Schubert, Ozlem Kalinli, Michael L. Seltzer
We show how factoring the RNN-T's output distribution can significantly reduce the computation cost and power consumption for on-device ASR inference with no loss in accuracy.
no code implementations • 4 Apr 2022 • Duc Le, Akshat Shrivastava, Paden Tomasello, Suyoun Kim, Aleksandr Livshits, Ozlem Kalinli, Michael L. Seltzer
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU), where a streaming automatic speech recognition (ASR) model produces the first-pass hypothesis and a second-pass natural language understanding (NLU) component generates the semantic parse by conditioning on both ASR's text and audio embeddings.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 28 Jan 2022 • Antoine Bruguier, Duc Le, Rohit Prabhavalkar, Dangna Li, Zhe Liu, Bo wang, Eun Chang, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer
We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech recognition, a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework.
no code implementations • 11 Oct 2021 • Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu Zhai, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
Measuring automatic speech recognition (ASR) system quality is critical for creating user-satisfying voice-driven applications.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 16 Jun 2021 • Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra
On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets.
no code implementations • 6 Apr 2021 • Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer
In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 6 Apr 2021 • Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer
As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 5 Apr 2021 • Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer
How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area.
no code implementations • 5 Apr 2021 • Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
DET gets similar accuracy as a baseline model with better latency on a large in-house data set by assigning a lightweight encoder for the beginning part of one utterance and a full-size encoder for the rest.
no code implementations • 5 Apr 2021 • Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
We define SemDist as the distance between a reference and hypothesis pair in a sentence-level embedding space.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +14
no code implementations • 23 Feb 2021 • Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra
Recurrent transducer models have emerged as a promising solution for speech recognition on the current and next generation smart devices.
no code implementations • 16 Nov 2020 • Duc Le, Gil Keren, Julian Chan, Jay Mahadeokar, Christian Fuegen, Michael L. Seltzer
End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 5 Nov 2020 • Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer
There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 3 Nov 2020 • Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank Zhang, Julian Chan, Michael L. Seltzer
Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 26 Oct 2020 • Suyoun Kim, Yuan Shangguan, Jay Mahadeokar, Antoine Bruguier, Christian Fuegen, Michael L. Seltzer, Duc Le
Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech recognition model architectures, has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training.
no code implementations • 18 May 2020 • Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer
Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 27 Nov 2019 • Yi-Chen Chen, Zhaojun Yang, Ching-Feng Yeh, Mahaveer Jain, Michael L. Seltzer
As one of the major sources in speech variability, accents have posed a grand challenge to the robustness of speech recognition systems.
no code implementations • 5 Nov 2019 • Mahaveer Jain, Kjell Schubert, Jay Mahadeokar, Ching-Feng Yeh, Kaustubh Kalgaonkar, Anuroop Sriram, Christian Fuegen, Michael L. Seltzer
Neural transducer-based systems such as RNN Transducers (RNN-T) for automatic speech recognition (ASR) blend the individual components of a traditional hybrid ASR systems (acoustic model, language model, punctuation model, inverse text normalization) into one single model.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 28 Oct 2019 • Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer
We explore options to use Transformer networks in neural transducer for end-to-end speech recognition.
no code implementations • 22 Oct 2019 • Duc Le, Thilo Koehler, Christian Fuegen, Michael L. Seltzer
Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 22 Oct 2019 • Yongqiang Wang, Abdel-rahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer
We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition.
Ranked #23 on Speech Recognition on LibriSpeech test-other (using extra training data)
no code implementations • 2 Oct 2019 • Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fügen, Geoffrey Zweig, Michael L. Seltzer
There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance, especially on English which has poor grapheme-phoneme correspondence.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 5 Dec 2018 • Zhehuai Chen, Mahaveer Jain, Yongqiang Wang, Michael L. Seltzer, Christian Fuegen
In this work, we focus on contextual speech recognition, which is particularly challenging for E2E models because it introduces significant mismatch between training and test data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 6 Nov 2017 • Suyoun Kim, Michael L. Seltzer, Jinyu Li, Rui Zhao
Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training.
no code implementations • 6 Nov 2017 • Suyoun Kim, Michael L. Seltzer
Building speech recognizers in multiple languages typically involves replicating a monolingual training recipe for each language, or utilizing a multi-task learning approach where models for different languages have separate output labels but share some internal parameters.
no code implementations • 17 Aug 2017 • Jinyu Li, Michael L. Seltzer, Xi Wang, Rui Zhao, Yifan Gong
High accuracy speech recognition requires a large amount of transcribed data for supervised training.