Search Results for author: Olivier Siohan

Found 12 papers, 1 papers with code

Audio-visual fine-tuning of audio-only ASR models

no code implementations • 14 Dec 2023 • Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data.

Automatic Speech Recognition Self-Supervised Learning +2

Paper
Add Code

Revisiting the Entropy Semiring for Neural Speech Recognition

no code implementations • 13 Dec 2023 • Oscar Chang, Dongseong Hwang, Olivier Siohan

In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation.

speech-recognition Speech Recognition

Paper
Add Code

On Robustness to Missing Video for Audiovisual Speech Recognition

no code implementations • 13 Dec 2023 • Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model.

speech-recognition Speech Recognition

Paper
Add Code

Conformers are All You Need for Visual Speech Recognition

no code implementations • 17 Feb 2023 • Oscar Chang, Hank Liao, Dmitriy Serdyuk, Ankit Shah, Olivier Siohan

We achieve a new state-of-the-art of 12. 8% WER for visual speech recognition on the TED LRS3 dataset, which rivals the performance of audio-only models from just four years ago.

speech-recognition Visual Speech Recognition

Paper
Add Code

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

no code implementations • 11 May 2022 • Otavio Braga, Olivier Siohan

As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

no code implementations • 11 May 2022 • Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao

Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

no code implementations • 10 May 2022 • Otavio Braga, Olivier Siohan

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

End-to-end multi-talker audio-visual ASR using an active speaker attention module

no code implementations • 1 Apr 2022 • Richard Rose, Olivier Siohan

This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition.

speech-recognition Speech Recognition

Paper
Add Code

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

no code implementations • 25 Jan 2022 • Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1. 6% WER).

Audio-Visual Speech Recognition Automatic Speech Recognition +4

Paper
Add Code

Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

no code implementations • 20 Sep 2021 • Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end.

Audio-Visual Speech Recognition Automatic Speech Recognition +5

Paper
Add Code

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

no code implementations • 25 Apr 2021 • Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

1 code implementation • 8 Nov 2019 • Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture.

Ranked #5 on Audio-Visual Speech Recognition on LRS3-TED (using extra training data)

Audio-Visual Speech Recognition Lipreading +2

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.