Lip Reading

46 papers with code • 3 benchmarks • 5 datasets

Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.

Source: Mutual Information Maximization for Effective Lip Reading

Benchmarks

Add a Result

These leaderboards are used to track progress in Lip Reading

Dataset	Best Model	Compare
GRID corpus (mixed-speech)	Lip2Wav	See all
TCD-TIMIT corpus (mixed-speech)	Lip2Wav	See all
LRW	Lip2Wav	See all

Datasets

Subtasks

Lip password classification

Latest papers

Most implemented Social Latest No code

GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

yerfor/geneface • • 31 Jan 2023

Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality.

2,337

31 Jan 2023

Paper
Code

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

iip-sogang/olkavs-avspeech • • 16 Jan 2023

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed.

16 Jan 2023

Paper
Code

Audio-Visual Efficient Conformer for Robust Speech Recognition

burchim/avec • • 4 Jan 2023

We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks.

04 Jan 2023

Paper
Code

Lip Sync Matters: A Novel Multimodal Forgery Detector

sahibzadaadil/Lip-Sync-Matters-A-Novel-Multimodal-Forgery-Detector • • APSIPA ASC 2022 2022

Deepfake technology has advanced a lot, but it is a double-sided sword for the community.

07 Nov 2022

Paper
Code

Relaxed Attention for Transformer Models

Oguzhanercan/Vision-Transformers • 20 Sep 2022

The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and - for natural language processing tasks - lead to an implicitly learned internal language model in the autoregressive transformer decoder complicating the integration of external language models.

20 Sep 2022

Paper
Code

Training Strategies for Improved Lip-reading

mpc001/Lipreading_using_Temporal_Convolutional_Networks • • 3 Sep 2022

In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies, like self-distillation and using word boundary indicators.

365

03 Sep 2022

Paper
Code

Speaker-adaptive Lip Reading with User-dependent Padding

ms-dot-k/User-dependent-Padding • • 9 Aug 2022

In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding.

09 Aug 2022

Paper
Code

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

ms-dot-k/Multi-head-Visual-Audio-Memory • • The AAAI Conference on Artificial Intelligence (AAAI) 2022

With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement.

04 Apr 2022

Paper
Code

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

ms-dot-k/Visual-Audio-Memory • • ICCV 2021

By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.

04 Apr 2022

Paper
Code

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

lumia-group/leveraging-self-supervised-learning-for-avsr • • ACL 2022

In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding.

24 Feb 2022

Paper
Code

Lip Reading

Benchmarks Add a Result

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result