SepFormer is Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. It is mainly composed of multi-head attention and feed-forward layers. A dual-path framework (introduced by DPRNN) is adopted and RNNs are replaced with a multiscale pipeline composed of transformers that learn both short and long-term dependencies. The dual-path framework enables the mitigation of the quadratic complexity of transformers, as transformers in the dual-path framework process smaller chunks.
The model is based on the learned-domain masking approach and employs an encoder, a decoder, and a masking network, as shown in the figure. The encoder is fully convolutional, while the decoder employs two Transformers embedded inside the dual-path processing block. The decoder finally reconstructs the separated signals in the time domain by using the masks predicted by the masking network.
Source: Attention is All You Need in Speech SeparationPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Separation | 7 | 43.75% |
Speech Enhancement | 2 | 12.50% |
Speech Extraction | 1 | 6.25% |
Audio Source Separation | 1 | 6.25% |
Generalization Bounds | 1 | 6.25% |
Multi-Speaker Source Separation | 1 | 6.25% |
Speaker Verification | 1 | 6.25% |
Target Speaker Extraction | 1 | 6.25% |
Denoising | 1 | 6.25% |