no code implementations • 31 Mar 2024 • Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei
In this work, we introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach.
no code implementations • 16 Jan 2024 • Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka
The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics.
no code implementations • 3 Nov 2023 • Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li
With fewer than 20M trainable parameters and as little as 450 hours of English speech data for SQA generation, COSMIC exhibits emergent instruction-following and in-context learning capabilities in speech-to-text tasks.
no code implementations • 9 Nov 2022 • Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez
Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15. 9% and 11. 2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set.
no code implementations • 27 Oct 2022 • Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, Jinyu Li, Takuya Yoshioka
Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 6 Nov 2019 • Md Sahidullah, Jose Patino, Samuele Cornell, Ruiqing Yin, Sunit Sivasankaran, Hervé Bredin, Pavel Korshunov, Alessio Brutti, Romain Serizel, Emmanuel Vincent, Nicholas Evans, Sébastien Marcel, Stefano Squartini, Claude Barras
This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team.
no code implementations • WS 2018 • Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, Monojit Choudhury
Though the best performance gain of 1. 2{\%} WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1