no code implementations • 14 Dec 2023 • Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan
Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data.
no code implementations • 13 Dec 2023 • Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan
Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model.
no code implementations • 11 May 2022 • Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 11 May 2022 • Otavio Braga, Olivier Siohan
As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 10 May 2022 • Otavio Braga, Olivier Siohan
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 25 Jan 2022 • Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1. 6% WER).
Audio-Visual Speech Recognition Automatic Speech Recognition +4
no code implementations • 20 Sep 2021 • Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end.
Audio-Visual Speech Recognition Automatic Speech Recognition +5
1 code implementation • 8 Nov 2019 • Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.
Ranked #5 on Audio-Visual Speech Recognition on LRS3-TED (using extra training data)