Speech Separation
96 papers with code • 18 benchmarks • 16 datasets
The task of extracting all overlapping speech sources in a given mixed speech signal refers to the Speech Separation. Speech Separation is a special scenario of source separation problem, where the focus is only on the overlapping speech signal sources and other interferences such as music or noise signals are not the main concern of the study.
Source: A Unified Framework for Speech Separation
Image credit: Speech Separation of A Target Speaker Based on Deep Neural Networks
Libraries
Use these libraries to find Speech Separation models and implementationsLatest papers with no code
Robust Active Speaker Detection in Noisy Environments
Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments.
PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings
A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization.
Probing Self-supervised Learning Models with Target Speech Extraction
TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation.
Mixture to Mixture: Leveraging Close-talk Mixtures as Weak-supervision for Speech Separation
We propose mixture to mixture (M2M) training, a weakly-supervised neural speech separation algorithm that leverages close-talk mixtures as a weak supervision for training discriminative models to separate far-field mixtures.
Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor
We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers.
Resource-constrained stereo singing voice cancellation
We study the problem of stereo singing voice cancellation, a subtask of music source separation, whose goal is to estimate an instrumental background from a stereo mix.
Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization
The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework.
Hyperbolic Distance-Based Speech Separation
In this work, we explore the task of hierarchical distance-based speech separation defined on a hyperbolic manifold.
Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments
Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal.
Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation
Despite its success, previous studies showed that PIT is plagued by excessive label assignment switching in adjacent epochs, impeding the model to learn better label assignments.