1 code implementation • 27 Dec 2023 • Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg
We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model.
no code implementations • 18 Oct 2023 • Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg
We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays.
no code implementations • 18 Oct 2023 • Tae Jin Park, He Huang, Coleman Hooper, Nithin Koluguri, Kunal Dhawan, Ante Jukic, Jagadeesh Balam, Boris Ginsburg
This capability offers a tailored training environment for developing neural models suited for speaker diarization and voice activity detection.
1 code implementation • 13 Oct 2023 • Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg
We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 19 Sep 2023 • Krishna C. Puvvada, Nithin Rao Koluguri, Kunal Dhawan, Jagadeesh Balam, Boris Ginsburg
Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain.
no code implementations • 18 Sep 2023 • Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 11 Sep 2023 • Tae Jin Park, Kunal Dhawan, Nithin Koluguri, Jagadeesh Balam
In addition, these findings point to the potential of using LLMs to improve speaker diarization and other speech processing tasks by capturing semantic and contextual cues.
no code implementations • 13 Jul 2023 • He Huang, Jagadeesh Balam, Boris Ginsburg
We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90. 14% intent accuracy and 82. 27% SLURP-F1.
no code implementations • 8 May 2023 • Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg
Conformer-based models have become the dominant end-to-end architecture for speech processing tasks.
Ranked #1 on Speech Recognition on LibriSpeech test-other
no code implementations • 27 Oct 2022 • Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
We introduce TitaNet-LID, a compact end-to-end neural network for Spoken Language Identification (LID) that is based on the ContextNet architecture.
no code implementations • 30 Mar 2022 • Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg
First, we use multi-scale clustering as an initialization to estimate the number of speakers and obtain the average speaker representation vector for each speaker and each scale.
no code implementations • 22 Jul 2021 • Aleksei Kalinov, Somshubra Majumdar, Jagadeesh Balam, Boris Ginsburg
The basic idea is to introduce a parallel mixture of shallow networks instead of a very deep network.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 5 Apr 2021 • Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, Boris Ginsburg
We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 5 Apr 2021 • Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models.
Ranked #3 on Speech Recognition on SPGISpeech