La Furca: Iterative Context-Aware End-to-End Monaural Speech Separation Based on Dual-Path Deep Parallel Inter-Intra Bi-LSTM with Attention

23 Jan 2020  ·  Ziqiang Shi, Rujie Liu, Jiqing Han ·

Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}. In this paper, we propose several improvements of dual-path BiLSTM based network for end-to-end approach to monaural speech separation, which consists of 1) dual-path network with intra-parallel BiLSTM and inter-parallel BiLSTM components, 2) global context aware inter-intra cross-parallel BiLSTM, 3) local context-aware network with attention BiLSTM, 4) multiple spiral iterative refinement dual-path BiLSTM (this method is also called PitchFork), that all these networks take the mixed utterance of two speakers and map it to two separated utterances, where each utterance contains only one speaker's voice. For the objective, we propose to train the network by directly optimizing utterance level signal-to-distortion ratio (SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus results in 19.86dB SDR improvement, 3.63 of PESQ, and 94.2\% of ESTOI, which shows our proposed networks can lead to performance improvement on the speaker separation task. We have open-sourced our re-implementation of the DPRNN-TasNet in https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation, and our `La Furca' is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be smoothly reproduced.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper