Speaker Diarization

74 papers with code • 12 benchmarks • 11 datasets

Speaker Diarization is the task of segmenting and co-indexing audio recordings by speaker. The way the task is commonly defined, the goal is not to identify known speakers, but to co-index segments that are attributed to the same speaker; in other words, diarization implies finding speaker boundaries and grouping segments that belong to the same speaker, and, as a by-product, determining the number of distinct speakers. In combination with speech recognition, diarization enables speaker-attributed speech-to-text transcription.

Source: Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

Libraries

Use these libraries to find Speaker Diarization models and implementations

Latest papers with no code

Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning

no code yet • 16 Apr 2024

In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates.

Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization

no code yet • 21 Mar 2024

Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components.

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

no code yet • 11 Mar 2024

Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data.

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

no code yet • 4 Mar 2024

A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization.

Listening to Multi-talker Conversations: Modular and End-to-end Perspectives

no code yet • 14 Feb 2024

For this, we describe the Streaming Unmixing and Recognition Transducer (SURT).

Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection

no code yet • 13 Feb 2024

A channel-number invariant loss is proposed to learn a unique feature representation regardless of the number of available microphones.

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

no code yet • 12 Feb 2024

In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount.

Spatial-Temporal Activity-Informed Diarization and Separation

no code yet • 30 Jan 2024

The global spatial activity functions are computed from the global spatial coherence functions based on frequency-averaged local spatial activity functions.

Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

no code yet • 23 Jan 2024

Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.

Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

no code yet • 16 Jan 2024

The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework.