TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Node Classification	AVA	ASDNet [ASDNet_ICCV2021]	mAP	93.5	# 1
Node Classification	AVA	TalkNet [tao2021someone]	mAP	92.3	# 2
Node Classification	AVA	UniCon [zhang2021unicon]	mAP	92	# 3
Node Classification	AVA	MAAS-TAN [MAAS2021]	mAP	88.8	# 4
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	SPELL+	validation mean average precision	94.9%	# 1
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	SPELL	validation mean average precision	94.2%	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-long-term-spatial-temporal-graphs/node-classification-on-ava)](https://paperswithcode.com/sota/node-classification-on-ava?p=learning-long-term-spatial-temporal-graphs)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-long-term-spatial-temporal-graphs/audio-visual-active-speaker-detection-on-ava)](https://paperswithcode.com/sota/audio-visual-active-speaker-detection-on-ava?p=learning-long-term-spatial-temporal-graphs)`

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

15 Jul 2022 · Kyle Min, Sourya Roy, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar ·

Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available at https://github.com/SRA2/SPELL

PDF Abstract

Code

Add Remove Mark official

sra2/spell official

kylemin/SPELL official

Tasks

Add Remove

Audio-Visual Active Speaker Detection

Graph Learning

Node Classification

Datasets

AVA

AVA-ActiveSpeaker

Aesthetic Visual Analysis

Results from the Paper

Edit

Ranked #1 on Node Classification on AVA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Node Classification	AVA	ASDNet [ASDNet_ICCV2021]	mAP	93.5	# 1	Compare
Node Classification	AVA	TalkNet [tao2021someone]	mAP	92.3	# 2	Compare
Node Classification	AVA	UniCon [zhang2021unicon]	mAP	92	# 3	Compare
Node Classification	AVA	MAAS-TAN [MAAS2021]	mAP	88.8	# 4	Compare
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	SPELL+	validation mean average precision	94.9%	# 1	Compare
Audio-Visual Active Speaker Detection	AVA-ActiveSpeaker	SPELL	validation mean average precision	94.2%	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove