Speech Separation
97 papers with code • 18 benchmarks • 16 datasets
The task of extracting all overlapping speech sources in a given mixed speech signal refers to the Speech Separation. Speech Separation is a special scenario of source separation problem, where the focus is only on the overlapping speech signal sources and other interferences such as music or noise signals are not the main concern of the study.
Source: A Unified Framework for Speech Separation
Image credit: Speech Separation of A Target Speaker Based on Deep Neural Networks
Libraries
Use these libraries to find Speech Separation models and implementationsLatest papers with no code
Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
For an effective fusion of the two modalities for diffusion, we also propose a cross-attention-based feature fusion mechanism.
Real-time Speech Enhancement and Separation with a Unified Deep Neural Network for Single/Dual Talker Scenarios
Unlike existing solutions that focus on modifying the loss function to accommodate zero-energy target signals, the proposed approach circumvents this problem by training the model to extract speech on both its output channels regardless if the input is a single or dual-talker mixture.
A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction
We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting.
GASS: Generalizing Audio Source Separation with Large-scale Data
Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset.
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset.
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition
This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation.
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs.
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
Recent research has made significant progress in designing fusion modules for audio-visual speech separation.
Improving Deep Attractor Network by BGRU and GMM for Speech Separation
Deep Attractor Network (DANet) is the state-of-the-art technique in speech separation field, which uses Bidirectional Long Short-Term Memory (BLSTM), but the complexity of the DANet model is very high.
Monaural Multi-Speaker Speech Separation Using Efficient Transformer Model
Cocktail party problem is the scenario where it is difficult to separate or distinguish individual speaker from a mixed speech from several speakers.