Visual Speech Recognition

40 papers with code • 2 benchmarks • 5 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Speech Recognition

Trend	Dataset	Best Model	Paper	Code	Compare
	LRS3-TED	CTC/Attention			See all
	LRS2	VTP with more data			See all

Datasets

Subtasks

Lip to Speech Synthesis

Latest papers

Most implemented Social Latest No code

MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

springhuo/mavd • 4 Jun 2023

Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction.

04 Jun 2023

Paper
Code

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

jasonppy/promptingwhisper • • 18 May 2023

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

123

18 May 2023

Paper
Code

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

yuchen005/gila • • 16 May 2023

However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task.

16 May 2023

Paper
Code

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

mpc001/auto_avsr • • 25 Mar 2023

Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets.

130

25 Mar 2023

Paper
Code

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

ms-dot-k/AVSR • • CVPR 2023

Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.

15 Mar 2023

Paper
Code

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

rongjiehuang/transpeech • • ICCV 2023

However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.

158

09 Mar 2023

Paper
Code

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

facebookresearch/muavic • 1 Mar 2023

We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages.

336

01 Mar 2023

Paper
Code

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

iip-sogang/olkavs-avspeech • • 16 Jan 2023

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed.

16 Jan 2023

Paper
Code

Jointly Learning Visual and Auditory Speech Representations from Raw Data

ahaliassos/raven • • 12 Dec 2022

We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained.

12 Dec 2022

Paper
Code

Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition

ms-dot-k/AVSR • • 13 Jul 2022

The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition.

13 Jul 2022

Paper
Code

Visual Speech Recognition

Benchmarks Add a Result

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result