Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
3D Action Recognition Assembly101 MS-G3D Actions Top-1 28.7 # 4
Verbs Top-1 65.7 # 2
Object Top-1 36.3 # 4
Skeleton Based Action Recognition Kinetics-Skeleton dataset MS-G3D Accuracy 38.0 # 9
Skeleton Based Action Recognition NTU RGB+D MS-G3D Net Accuracy (CV) 96.2 # 33
Accuracy (CS) 91.5 # 27
Skeleton Based Action Recognition NTU RGB+D 120 MS-G3D Net Accuracy (Cross-Subject) 86.9% # 27
Accuracy (Cross-Setup) 88.4% # 27

Methods