Speaker Diarization

74 papers with code • 12 benchmarks • 11 datasets

Speaker Diarization is the task of segmenting and co-indexing audio recordings by speaker. The way the task is commonly defined, the goal is not to identify known speakers, but to co-index segments that are attributed to the same speaker; in other words, diarization implies finding speaker boundaries and grouping segments that belong to the same speaker, and, as a by-product, determining the number of distinct speakers. In combination with speech recognition, diarization enables speaker-attributed speech-to-text transcription.

Source: Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

Benchmarks

Add a Result

These leaderboards are used to track progress in Speaker Diarization

Dataset	Best Model	Compare
CALLHOME	TOLD	See all
NIST-SRE 2000	x-vector (MCGAN)	See all
AMI Lapel	TitaNet-M (NME-SC)	See all
AMI MixHeadset	TitaNet-L (NME-SC)	See all
CH109	TitaNet-S (NME-SC)	See all
DIHARD	pyannote (waveform)	See all
ETAPE	pyannote (waveform)	See all
CALLHOME-109	titanet-s	See all
AMI	pyannote (waveform)	See all
Hub5'00 CallHome	UIS-RNN	See all
DIHARD II	UIS-RNN-SML	See all
AliMeeting	SOND	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Speaker Diarization models and implementations

hitachi-speech/EEND

5 papers

347

pyannote/pyannote-audio

3 papers

4,978

alibaba-damo-academy/FunASR

3 papers

3,115

wq2012/SpectralCluster

3 papers

488

See all 5 libraries.

Datasets

Latest papers with no code

Most implemented Social Latest No code

Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning

no code yet • 16 Apr 2024

In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates.

Paper
Add Code

Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization

no code yet • 21 Mar 2024

Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components.

Paper
Add Code

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

no code yet • 11 Mar 2024

Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data.

Paper
Add Code

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

no code yet • 4 Mar 2024

A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization.

Paper
Add Code

Listening to Multi-talker Conversations: Modular and End-to-end Perspectives

no code yet • 14 Feb 2024

For this, we describe the Streaming Unmixing and Recognition Transducer (SURT).

Paper
Add Code

Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection

no code yet • 13 Feb 2024

A channel-number invariant loss is proposed to learn a unique feature representation regardless of the number of available microphones.

Paper
Add Code

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

no code yet • 12 Feb 2024

In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount.

Paper
Add Code

Spatial-Temporal Activity-Informed Diarization and Separation

no code yet • 30 Jan 2024

The global spatial activity functions are computed from the global spatial coherence functions based on frequency-averaged local spatial activity functions.

Paper
Add Code

Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

no code yet • 23 Jan 2024

Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.

Paper
Add Code

Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

no code yet • 16 Jan 2024

The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework.

Paper
Add Code

Speaker Diarization

Benchmarks Add a Result

Libraries

Datasets

Latest papers with no code

Content

Benchmarks

Add a Result