Lip Reading
43 papers with code • 3 benchmarks • 5 datasets
Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.
Source: Mutual Information Maximization for Effective Lip Reading
Latest papers
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism
Our method, which we call NEUral Text to ARticulate Talk (NEUTART), is a talking face generator that uses a joint audiovisual feature space, as well as speech-informed 3D facial reconstructions and a lip-reading loss for visual supervision.
Do VSR Models Generalize Beyond LRS3?
The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years.
Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading
For deep layers where both the speaker's features and the speech content features are all expressed well, we introduce the speaker-adaptive features to learn for suppressing the speech content irrelevant noise for robust lip reading.
SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces
To enhance the visual accuracy of generated lip movement while reducing the dependence on labeled data, we propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
We demonstrate that OpenSR enables modality transfer from one to any in three different settings (zero-, few- and full-shot), and achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier.
A Novel Interpretable and Generalizable Re-synchronization Model for Cued Speech based on a Multi-Cuer Corpus
Cued Speech (CS) is a multi-modal visual coding system combining lip reading with several hand cues at the phonetic level to make the spoken language visible to the hearing impaired.
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results.
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis
Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality.