VPN: Learning Video-Pose Embedding for Activities of Daily Living

In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract

Results from the Paper


Ranked #6 on Action Classification on Toyota Smarthome dataset (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition NTU RGB+D VPN (RGB + Pose) Accuracy (CS) 95.5 # 6
Accuracy (CV) 98.0 # 7
Action Recognition NTU RGB+D 120 VPN (RGB + Pose) Accuracy (Cross-Subject) 87.8 # 11
Accuracy (Cross-Setup) 86.3 # 12
Skeleton Based Action Recognition NTU RGB+D 120 VPN Accuracy (Cross-Subject) 86.3 # 30
Accuracy (Cross-Setup) 87.8 # 32
Skeleton Based Action Recognition N-UCLA VPN (RGB + Pose) Accuracy 93.5 # 13
Action Classification Toyota Smarthome dataset VPN (RGB + Pose) CS 60.8 # 6
CV1 43.8 # 2
CV2 53.5 # 4

Methods


No methods listed for this paper. Add relevant methods here