CAST: Cross-Attention in Space and Time for Video Action Recognition

NeurIPS 2023  ·  DongHo Lee, Jongseo Lee, Jinwoo Choi ·

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Recognition EPIC-KITCHENS-100 CAST-B/16 Action@1 49.3 # 7
Verb@1 72.5 # 2
Noun@1 60.9 # 8
Action Classification Kinetics-400 CAST-B/16 Acc@1 85.3 # 49
Action Recognition Something-Something V2 CAST-B/16 Top-1 Accuracy 71.6 # 26

Methods


No methods listed for this paper. Add relevant methods here