Speech Recognition
1079 papers with code • 314 benchmarks • 86 datasets
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
Libraries
Use these libraries to find Speech Recognition models and implementationsDatasets
Subtasks
Latest papers
FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer
The success of a specific neural network architecture is closely tied to the dataset and task it tackles; there is no one-size-fits-all solution.
SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages
Benchmarking plays a pivotal role in assessing and enhancing the performance of compact deep learning models designed for execution on resource-constrained devices, such as microcontrollers.
SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation
In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation.
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason.
Language and Speech Technology for Central Kurdish Varieties
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties.
A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition
To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR).
Multilingual Speech Models for Automatic Speech Recognition Exhibit Gender Performance Gaps
However, the advantaged group varies between languages.
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements.
HINT: High-quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention
In this paper, we propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel mask-aware pixel-shuffle downsampling module (MPD) to preserve the visible information extracted from the corrupted image while maintaining the integrity of the information available for high-level inferences made within the model.
How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena
The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity.