Relational Self-Attention: What's Missing in Attention for Video Understanding

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition Diving-48 RSANet-R50 (16 frames, ImageNet pretrained, a single clip) Accuracy 84.2 # 11
Action Recognition Something-Something V1 RSANet-R50 (8 frames, ImageNet pretrained, a single clip) Top 1 Accuracy 51.9 # 45
Top 5 Accuracy 79.6 # 26
Action Recognition Something-Something V1 RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) Top 1 Accuracy 56.1 # 22
Top 5 Accuracy 82.8 # 15
Action Recognition Something-Something V1 RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) Top 1 Accuracy 55.5 # 24
Top 5 Accuracy 82.6 # 17
Action Recognition Something-Something V1 RSANet-R50 (16 frames, ImageNet pretrained, a single clip) Top 1 Accuracy 54.0 # 33
Top 5 Accuracy 81.1 # 23
Action Recognition Something-Something V2 RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips) Top-5 Accuracy 91.1 # 43
Action Recognition Something-Something V2 RSANet-R50 (8 frames, ImageNet pretrained, a single clip) Top-1 Accuracy 64.8 # 91
Top-5 Accuracy 89.1 # 73
Action Recognition Something-Something V2 RSANet-R50 (16 frames, ImageNet pretrained, a single clip) Top-1 Accuracy 66 # 82
Top-5 Accuracy 89.8 # 63
Action Recognition Something-Something V2 RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip) Top-1 Accuracy 67.3 # 65
Top-5 Accuracy 90.8 # 49
Action Recognition Something-Something V2 RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips Top-1 Accuracy 67.7 # 60
Top-5 Accuracy 91.1 # 43

Methods