Speaker Diarization
74 papers with code • 12 benchmarks • 11 datasets
Speaker Diarization is the task of segmenting and co-indexing audio recordings by speaker. The way the task is commonly defined, the goal is not to identify known speakers, but to co-index segments that are attributed to the same speaker; in other words, diarization implies finding speaker boundaries and grouping segments that belong to the same speaker, and, as a by-product, determining the number of distinct speakers. In combination with speech recognition, diarization enables speaker-attributed speech-to-text transcription.
Source: Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm
Libraries
Use these libraries to find Speaker Diarization models and implementationsDatasets
Latest papers
Long-term Conversation Analysis: Exploring Utility and Privacy
The analysis of conversations recorded in everyday life requires privacy protection.
Speech Emotion Diarization: Which Emotion Appears When?
Speech Emotion Recognition (SER) typically relies on utterance-level solutions.
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks
In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively.
Neural Diarization with Non-autoregressive Intermediate Attractors
The experiments with the two-speaker CALLHOME dataset show that the intermediate labels with the proposed non-autoregressive intermediate attractors boost the diarization performance.
TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization
Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios.
A Light Weight Model for Active Speaker Detection
Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94. 1% vs. 94. 2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1. 0M vs. 22. 5M, about 23x) and FLOPs (0. 6G vs. 2. 6G, about 4x).
VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.
BER: Balanced Error Rate For Speaker Diarization
DER is the primary metric to evaluate diarization performance while facing a dilemma: the errors in short utterances or segments tend to be overwhelmed by longer ones.
On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors
Out-of-distribution (OOD) detection is concerned with identifying data points that do not belong to the same distribution as the model's training data.
Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering
While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems.