Audio-Visual Speech Recognition
27 papers with code • 3 benchmarks • 6 datasets
Audio-visual speech recognition is the task of transcribing a paired audio and visual stream into text.
Most implemented papers
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition.
Jointly Learning Visual and Auditory Speech Representations from Raw Data
We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained.
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed.
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages.
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets.
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task.
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information
Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction.
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
We demonstrate that OpenSR enables modality transfer from one to any in three different settings (zero-, few- and full-shot), and achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.