Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech.

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

In this work, we explore the task of lip to speech synthesis, i. e., learning to generate natural speech given only the lip movements of a speaker.

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.

Lipreading using Temporal Convolutional Networks

We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.

Combining Residual Networks with LSTMs for Lipreading

We propose an end-to-end deep learning architecture for word-level visual speech recognition.

Learn an Effective Lip Reading Model without Pains

Considering the non-negligible effects of these strategies and the existing tough status to train an effective lip reading model, we perform a comprehensive quantitative study and comparative analysis, for the first time, to show the effects of several different choices for lip reading.

Lip2AudSpec: Speech reconstruction from silent lip movements video

In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos.

Seeing wake words: Audio-visual Keyword Spotting

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio.

XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification

Our work improves on existing multimodal deep learning algorithms in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections which only transfer information between streams that process compatible data.

Mutual Information Maximization for Effective Lip Reading

By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.

