In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognising composite action of varying durations while other distinct composite or atomic actions occur in the background. Drawing upon the success of relational networks, we propose methods that learn to reason over the semantic concept of objects and actions. We empirically show how ANNs benefit from pretraining, relational inductive biases and unordered set-based latent representations. In this paper we propose deep set conditioned I3D (SCI3D), a two stream relational network that employs latent representation of state and visual representation for reasoning over events and actions. They learn to reason about temporally connected actions in order to identify all of them in the video. The proposed method achieves an improvement of around 1.49% mAP in atomic action recognition and 17.57% mAP in composite action recognition, over a I3D-NL baseline, on the CATER dataset.

PDF Abstract International Joint 2022 PDF International Joint 2022 Abstract

Results from the Paper


 Ranked #1 on Atomic action recognition on CATER (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Atomic action recognition CATER SCI3D Average-mAP 96.77 # 1
Atomic action recognition CATER R3D-NL Average-mAP 95.28 # 2
Atomic action recognition CATER FasterRCNN Average-mAP 63.85 # 4
Composite action recognition CATER Single stream SCI3D Average-mAP 69.76 # 1
Composite action recognition CATER FasterRCNN Average-mAP 25.45 # 4
Composite action recognition CATER R3D-NL Average-mAP 52.19 # 3
Composite action recognition CATER SCI3D Average-mAP 66.71 # 2
Atomic action recognition CATER Single stream SCI3D Average-mAP 91.82 # 3

Methods


No methods listed for this paper. Add relevant methods here