Speaker Diarization
74 papers with code • 12 benchmarks • 11 datasets
Speaker Diarization is the task of segmenting and co-indexing audio recordings by speaker. The way the task is commonly defined, the goal is not to identify known speakers, but to co-index segments that are attributed to the same speaker; in other words, diarization implies finding speaker boundaries and grouping segments that belong to the same speaker, and, as a by-product, determining the number of distinct speakers. In combination with speech recognition, diarization enables speaker-attributed speech-to-text transcription.
Source: Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm
Libraries
Use these libraries to find Speaker Diarization models and implementationsDatasets
Latest papers with no code
Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning
In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates.
Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components.
Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications
Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data.
PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings
A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization.
Listening to Multi-talker Conversations: Modular and End-to-end Perspectives
For this, we describe the Streaming Unmixing and Recognition Transducer (SURT).
Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection
A channel-number invariant loss is proposed to learn a unique feature representation regardless of the number of available microphones.
The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models
In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount.
Spatial-Temporal Activity-Informed Diarization and Separation
The global spatial activity functions are computed from the global spatial coherence functions based on frequency-averaged local spatial activity functions.
Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization
Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.
Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization
The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework.