1 code implementation • 1 Nov 2023 • Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers.
no code implementations • 4 Aug 2023 • Yogesh Virkar, Brian Thompson, Rohit Paturi, Sundararajan Srinivasan, Marcello Federico
The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language.
no code implementations • 15 Jun 2023 • Rohit Paturi, Sundararajan Srinivasan, Xiang Li
Speaker diarization (SD) is typically used with an automatic speech recognition (ASR) system to ascribe speaker labels to recognized words.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 23 Nov 2022 • Dhanush Bekal, Sundararajan Srinivasan, Sravan Bodapati, Srikanth Ronanki, Katrin Kirchhoff
In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 10 Dec 2021 • Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff, Daniel Garcia-Romero
Also, most of these models are trained with synthetic mixtures and do not generalize to real conversational data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 30 Nov 2021 • Sundararajan Srinivasan, Zhaocheng Huang, Katrin Kirchhoff
To improve the efficacy of our approach, we propose a novel estimate of the quality of the emotion predictions, to condition teacher-student training.
no code implementations • 10 Jun 2021 • Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff
Determining the cause of diarization errors is difficult because speaker voice acoustics and conversation structure co-vary, and the interactions between acoustics, conversational structure, and diarization accuracy are complex.
no code implementations • 10 Mar 2021 • Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau
Training deep neural networks for automatic speech recognition (ASR) requires large amounts of transcribed speech.