TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Speech Separation	LRS2	RTFS-Net-12	SI-SNRi	14.9	# 5
Speech Separation	LRS2	RTFS-Net-12	SDRi	15.1	# 3
Speech Separation	LRS2	RTFS-Net-4	SI-SNRi	14.1	# 2
Speech Separation	LRS2	RTFS-Net-4	SDRi	14.3	# 5
Speech Separation	LRS2	RTFS-Net-6	SI-SNRi	14.6	# 4
Speech Separation	LRS2	RTFS-Net-6	SDRi	14.8	# 4
Speech Separation	LRS3	RTFS-Net-4	SI-SNRi	15.5	# 1
Speech Separation	LRS3	RTFS-Net-4	SDRi	15.6	# 3
Speech Separation	LRS3	RTFS-Net-6	SI-SNRi	16.9	# 2
Speech Separation	LRS3	RTFS-Net-6	SDRi	17.1	# 2
Speech Separation	LRS3	RTFS-Net-12	SI-SNRi	17.5	# 4
Speech Separation	LRS3	RTFS-Net-12	SDRi	17.6	# 1
Speech Separation	VoxCeleb2	RTFS-Net-12	SI-SNRi	12.4	# 4
Speech Separation	VoxCeleb2	RTFS-Net-12	SDRi	13.6	# 1
Speech Separation	VoxCeleb2	RTFS-Net-6	SI-SNRi	11.8	# 2
Speech Separation	VoxCeleb2	RTFS-Net-6	SDRi	12.8	# 2
Speech Separation	VoxCeleb2	RTFS-Net-4	SI-SNRi	11.5	# 1
Speech Separation	VoxCeleb2	RTFS-Net-4	SDRi	12.4	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtfs-net-recurrent-time-frequency-modelling/speech-separation-on-lrs3)](https://paperswithcode.com/sota/speech-separation-on-lrs3?p=rtfs-net-recurrent-time-frequency-modelling)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtfs-net-recurrent-time-frequency-modelling/speech-separation-on-voxceleb2)](https://paperswithcode.com/sota/speech-separation-on-voxceleb2?p=rtfs-net-recurrent-time-frequency-modelling)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rtfs-net-recurrent-time-frequency-modelling/speech-separation-on-lrs2)](https://paperswithcode.com/sota/speech-separation-on-lrs2?p=rtfs-net-recurrent-time-frequency-modelling)`

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

29 Sep 2023 · Samuel Pegg, Kai Li, Xiaolin Hu ·

Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the prior SOTA method in both inference speed and separation quality while reducing the number of parameters by 90% and MACs by 83%. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

PDF Abstract

Code

Add Remove Mark official

spkgyk/RTFS-Net official

Tasks

Add Remove

Audio-Visual Speech Recognition

speech-recognition

Speech Recognition

Speech Separation

Target Speaker Extraction

Datasets

VoxCeleb2

LRS2

VVAD-LRS3

Results from the Paper

Edit

Ranked #1 on Speech Separation on VoxCeleb2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Speech Separation	LRS2	RTFS-Net-12	SI-SNRi	14.9	# 5	Compare
Speech Separation	LRS2	RTFS-Net-12	SDRi	15.1	# 3	Compare
Speech Separation	LRS2	RTFS-Net-4	SI-SNRi	14.1	# 2	Compare
Speech Separation	LRS2	RTFS-Net-4	SDRi	14.3	# 5	Compare
Speech Separation	LRS2	RTFS-Net-6	SI-SNRi	14.6	# 4	Compare
Speech Separation	LRS2	RTFS-Net-6	SDRi	14.8	# 4	Compare
Speech Separation	LRS3	RTFS-Net-4	SI-SNRi	15.5	# 1	Compare
Speech Separation	LRS3	RTFS-Net-4	SDRi	15.6	# 3	Compare
Speech Separation	LRS3	RTFS-Net-6	SI-SNRi	16.9	# 2	Compare
Speech Separation	LRS3	RTFS-Net-6	SDRi	17.1	# 2	Compare
Speech Separation	LRS3	RTFS-Net-12	SI-SNRi	17.5	# 4	Compare
Speech Separation	LRS3	RTFS-Net-12	SDRi	17.6	# 1	Compare
Speech Separation	VoxCeleb2	RTFS-Net-12	SI-SNRi	12.4	# 4	Compare
Speech Separation	VoxCeleb2	RTFS-Net-12	SDRi	13.6	# 1	Compare
Speech Separation	VoxCeleb2	RTFS-Net-6	SI-SNRi	11.8	# 2	Compare
Speech Separation	VoxCeleb2	RTFS-Net-6	SDRi	12.8	# 2	Compare
Speech Separation	VoxCeleb2	RTFS-Net-4	SI-SNRi	11.5	# 1	Compare
Speech Separation	VoxCeleb2	RTFS-Net-4	SDRi	12.4	# 3	Compare

Methods

Add Remove

SPEED

Edit Social Preview

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove