TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Diving-48	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Accuracy	84.2	# 11
Action Recognition	Something-Something V1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	51.9	# 45
Action Recognition	Something-Something V1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top 5 Accuracy	79.6	# 26
Action Recognition	Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top 1 Accuracy	56.1	# 22
Action Recognition	Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top 5 Accuracy	82.8	# 15
Action Recognition	Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	55.5	# 24
Action Recognition	Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top 5 Accuracy	82.6	# 17
Action Recognition	Something-Something V1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	54.0	# 33
Action Recognition	Something-Something V1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top 5 Accuracy	81.1	# 23
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top-5 Accuracy	91.1	# 43
Action Recognition	Something-Something V2	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	64.8	# 91
Action Recognition	Something-Something V2	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top-5 Accuracy	89.1	# 73
Action Recognition	Something-Something V2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	66	# 82
Action Recognition	Something-Something V2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top-5 Accuracy	89.8	# 63
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	67.3	# 65
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top-5 Accuracy	90.8	# 49
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	Top-1 Accuracy	67.7	# 60
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	Top-5 Accuracy	91.1	# 43

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relational-self-attention-what-s-missing-in/action-recognition-on-diving-48)](https://paperswithcode.com/sota/action-recognition-on-diving-48?p=relational-self-attention-what-s-missing-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relational-self-attention-what-s-missing-in/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=relational-self-attention-what-s-missing-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relational-self-attention-what-s-missing-in/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=relational-self-attention-what-s-missing-in)`

Relational Self-Attention: What's Missing in Attention for Video Understanding

NeurIPS 2021 · Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho ·

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Code

Add Remove Mark official

KimManjin/RSA official

Tasks

Add Remove

Action Recognition

Temporal Action Localization

Video Understanding

Datasets

Something-Something V2

Something-Something V1

FineGym

Results from the Paper

Edit

Ranked #11 on Action Recognition on Diving-48

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Diving-48	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Accuracy	84.2	# 11	Compare
Action Recognition	Something-Something V1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	51.9	# 45	Compare
Action Recognition	Something-Something V1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top 5 Accuracy	79.6	# 26	Compare
Action Recognition	Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top 1 Accuracy	56.1	# 22	Compare
Action Recognition	Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top 5 Accuracy	82.8	# 15	Compare
Action Recognition	Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	55.5	# 24	Compare
Action Recognition	Something-Something V1		Top 5 Accuracy	82.6	# 17	Compare
Action Recognition	Something-Something V1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	54.0	# 33	Compare
Action Recognition	Something-Something V1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top 5 Accuracy	81.1	# 23	Compare
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top-5 Accuracy	91.1	# 43	Compare
Action Recognition	Something-Something V2	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	64.8	# 91	Compare
Action Recognition	Something-Something V2	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top-5 Accuracy	89.1	# 73	Compare
Action Recognition	Something-Something V2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	66	# 82	Compare
Action Recognition	Something-Something V2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top-5 Accuracy	89.8	# 63	Compare
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	67.3	# 65	Compare
Action Recognition	Something-Something V2		Top-5 Accuracy	90.8	# 49	Compare
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	Top-1 Accuracy	67.7	# 60	Compare
Action Recognition	Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	Top-5 Accuracy	91.1	# 43	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Convolution • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Relational Self-Attention: What's Missing in Attention for Video Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove