Speech Recognition
1096 papers with code • 234 benchmarks • 87 datasets
Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.
( Image credit: SpecAugment )
Libraries
Use these libraries to find Speech Recognition models and implementationsDatasets
Subtasks
Latest papers
How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena
The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity.
Careless Whisper: Speech-to-Text Hallucination Harms
We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group.
DeepCover: Advancing RNN Test Coverage and Online Error Prediction using State Machine Extraction
The proposed methodology along with its assessment metrics contribute to increasing explainability in RNN models by providing a clear representation of their internal decision making process through the extracted SM.
Streaming Sequence Transduction through Dynamic Compression
We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams.
On Speaker Attribution with SURT
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR).
Towards Event Extraction from Speech with Contextual Clues
While text-based event extraction has been an active research area and has seen successful application in many domains, extracting semantic events from speech directly is an under-explored problem.
TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion
TDANet serves as the architectural foundation for the auditory and visual networks within TDFNet, offering an efficient model with fewer parameters.
Word-Level ASR Quality Estimation for Efficient Corpus Sampling and Post-Editing through Analyzing Attentions of a Reference-Free Metric
The findings suggest that NoRefER is not merely a tool for error detection but also a comprehensive framework for enhancing ASR systems' transparency, efficiency, and effectiveness.
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER.
Cascaded Cross-Modal Transformer for Audio-Textual Classification
Subsequently, we combine language-specific Bidirectional Encoder Representations from Transformers (BERT) with Wav2Vec2. 0 audio features via a novel cascaded cross-modal transformer (CCMT).