Lip Reading

46 papers with code • 3 benchmarks • 5 datasets

Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.

Source: Mutual Information Maximization for Effective Lip Reading

Most implemented papers

Seeing wake words: Audio-visual Keyword Spotting

lilianemomeni/KWS-Net 2 Sep 2020

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio.

Lip-reading with Densely Connected Temporal Convolutional Networks

mpc001/Lipreading_using_Temporal_Convolutional_Networks 29 Sep 2020

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words.

Learn an Effective Lip Reading Model without Pains

Fengdalu/learn-an-effective-lip-reading-model-without-pains 15 Nov 2020

Considering the non-negligible effects of these strategies and the existing tough status to train an effective lip reading model, we perform a comprehensive quantitative study and comparative analysis, for the first time, to show the effects of several different choices for lip reading.

Contrastive Learning of Global-Local Video Representations

yunyikristy/global_local 7 Apr 2021

In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).

Multi-Perspective LSTM for Joint Visual Representation Learning

arsm/MPLSTM CVPR 2021

We validate the performance of our proposed architecture in the context of two multi-perspective visual recognition tasks namely lip reading and face recognition.

Selective Listening by Synchronizing Speech with Lips

zexupan/reentry 14 Jun 2021

A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-talker speech mixture when given a cue that represents the target speaker, such as a pre-enrolled speech utterance, or an accompanying video track.

Visual Keyword Spotting with Attention

prajwalkr/transpotter 29 Oct 2021

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting.

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

lumia-group/leveraging-self-supervised-learning-for-avsr ACL 2022

In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding.

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

ms-dot-k/Visual-Audio-Memory ICCV 2021

By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

ms-dot-k/Multi-head-Visual-Audio-Memory The AAAI Conference on Artificial Intelligence (AAAI) 2022

With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement.