Contrastive Learning based Deep Latent Masking for Music Source Separation

Interspeech 2023  ·  Jihyun Kim, Hong-Goo Kang ·

Recent studies on music source separation have extended their applicability to generic audio signals. Real-time applications for music source separation are necessary to provide services such as custom equalizers or to improve the sound of live streaming with diverse effects. However, most prior methods are unsuitable for real-time applications due to their high computational complexity, large memory usage, or long latency. To overcome these problems, we propose a Wave-U-Net type of music source separation network that utilizes high-dimensional masking for the deep latent domain features. We also introduce a contrastive learning technique to estimate the salient latent space embedding of each target source using a masking-based approach. The performance of our proposed model is evaluated on the MUSDB18HQ dataset in comparison with several baselines. The experiments confirm that our proposed model is capable of real-time processing and outperforms existing models.

PDF

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Music Source Separation MUSDB18 DLMNet SDR (vocals) 6.91 # 17
SDR (drums) 7.05 # 14
SDR (other) 4.62 # 17
SDR (bass) 7.29 # 10
SDR (avg) 6.47 # 13

Methods