Speaker Diarization is the task of segmenting and co-indexing audio recordings by speaker. The way the task is commonly defined, the goal is not to identify known speakers, but to co-index segments that are attributed to the same speaker; in other words, diarization implies finding speaker boundaries and grouping segments that belong to the same speaker, and, as a by-product, determining the number of distinct speakers. In combination with speech recognition, diarization enables speaker-attributed speech-to-text transcription.
The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible.
Our method was even better than that of the state-of-the-art x-vector clustering-based method.
To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem.
Recently, a fully supervised speaker diarization approach was proposed (UIS-RNN) which models speakers using multiple instances of a parameter-sharing recurrent neural network.
Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc.
In this study, we propose a new spectral clustering framework that can auto-tune the parameters of the clustering algorithm in the context of speaker diarization.
More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction.
However, the clustering-based approach has a number of problems; i. e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps.