Lip Reading
46 papers with code • 3 benchmarks • 5 datasets
Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.
Source: Mutual Information Maximization for Effective Lip Reading
Most implemented papers
Seeing wake words: Audio-visual Keyword Spotting
The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio.
Lip-reading with Densely Connected Temporal Convolutional Networks
In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words.
Learn an Effective Lip Reading Model without Pains
Considering the non-negligible effects of these strategies and the existing tough status to train an effective lip reading model, we perform a comprehensive quantitative study and comparative analysis, for the first time, to show the effects of several different choices for lip reading.
Contrastive Learning of Global-Local Video Representations
In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e. g., classification) and the tasks that require local fine-grained spatio-temporal information (e. g., localization).
Multi-Perspective LSTM for Joint Visual Representation Learning
We validate the performance of our proposed architecture in the context of two multi-perspective visual recognition tasks namely lip reading and face recognition.
Selective Listening by Synchronizing Speech with Lips
A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-talker speech mixture when given a cue that represents the target speaker, such as a pre-enrolled speech utterance, or an accompanying video track.
Visual Keyword Spotting with Attention
In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting.
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding.
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video
By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks.
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement.