TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Speech Separation	LRS2	TDFNet-small	SI-SNRi	13.6	# 1
Speech Separation	LRS2	TDFNet-small	SDRi	13.7	# 6
Speech Separation	LRS2	TDFNet-small	PESQ	3.10	# 3
Speech Separation	LRS2	TDFNet-small	STOI	0.931	# 3
Speech Separation	LRS2	TDFNet-large	SI-SNRi	15.8	# 7
Speech Separation	LRS2	TDFNet-large	SDRi	15.9	# 1
Speech Separation	LRS2	TDFNet-large	PESQ	3.21	# 1
Speech Separation	LRS2	TDFNet-large	STOI	0.949	# 1
Speech Separation	LRS2	TDFNet (MHSA + Shared)	SI-SNRi	15.0	# 6
Speech Separation	LRS2	TDFNet (MHSA + Shared)	SDRi	15.2	# 2
Speech Separation	LRS2	TDFNet (MHSA + Shared)	PESQ	3.16	# 2
Speech Separation	LRS2	TDFNet (MHSA + Shared)	STOI	0.938	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tdfnet-an-efficient-audio-visual-speech/speech-separation-on-lrs2)](https://paperswithcode.com/sota/speech-separation-on-lrs2?p=tdfnet-an-efficient-audio-visual-speech)`

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

25 Jan 2024 · Samuel Pegg, Kai Li, Xiaolin Hu ·

Audio-visual speech separation has gained significant traction in recent years due to its potential applications in various fields such as speech recognition, diarization, scene analysis and assistive technologies. Designing a lightweight audio-visual speech separation network is important for low-latency applications, but existing methods often require higher computational costs and more parameters to achieve better separation performance. In this paper, we present an audio-visual speech separation model called Top-Down-Fusion Net (TDFNet), a state-of-the-art (SOTA) model for audio-visual speech separation, which builds upon the architecture of TDANet, an audio-only speech separation method. TDANet serves as the architectural foundation for the auditory and visual networks within TDFNet, offering an efficient model with fewer parameters. On the LRS2-2Mix dataset, TDFNet achieves a performance increase of up to 10\% across all performance metrics compared with the previous SOTA method CTCNet. Remarkably, these results are achieved using fewer parameters and only 28\% of the multiply-accumulate operations (MACs) of CTCNet. In essence, our method presents a highly effective and efficient solution to the challenges of speech separation within the audio-visual domain, making significant strides in harnessing visual information optimally.

PDF Abstract

Code

Add Remove Mark official

spkgyk/TDFNet official

Tasks

Add Remove

speech-recognition

Speech Recognition

Speech Separation

Datasets

LRS2

Results from the Paper

Add Remove

Ranked #1 on Speech Separation on LRS2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Speech Separation	LRS2	TDFNet-small	SI-SNRi	13.6	# 1	Compare
			SDRi	13.7	# 6	Compare
			PESQ	3.10	# 3	Compare
			STOI	0.931	# 3	Compare
Speech Separation	LRS2	TDFNet-large	SI-SNRi	15.8	# 7	Compare
			SDRi	15.9	# 1	Compare
			PESQ	3.21	# 1	Compare
			STOI	0.949	# 1	Compare
Speech Separation	LRS2	TDFNet (MHSA + Shared)	SI-SNRi	15.0	# 6	Compare
			SDRi	15.2	# 2	Compare
			PESQ	3.16	# 2	Compare
			STOI	0.938	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove