Continual Spatio-Temporal Graph Convolutional Networks

21 Mar 2022  ·  Lukas Hedegaard, Negar Heidari, Alexandros Iosifidis ·

Graph-based reasoning over skeleton data has emerged as a promising approach for human action recognition. However, the application of prior graph-based methods, which predominantly employ whole temporal sequences as their input, to the setting of online inference entails considerable computational redundancy. In this paper, we tackle this issue by reformulating the Spatio-Temporal Graph Convolutional Neural Network as a Continual Inference Network, which can perform step-by-step predictions in time without repeat frame processing. To evaluate our method, we create a continual version of ST-GCN, CoST-GCN, alongside two derived methods with different self-attention mechanisms, CoAGCN and CoS-TR. We investigate weight transfer strategies and architectural modifications for inference acceleration, and perform experiments on the NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400 datasets. Retaining similar predictive accuracy, we observe up to 109x reduction in time complexity, on-hardware accelerations of 26x, and reductions in maximum allocated memory of 52% during online inference.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoAGCN (2-stream) GFLOPS per prediction 0.36 # 7
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoS-TR (2-stream) Accuracy 32.7 # 30
GFLOPS per prediction 0.31 # 9
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoS-TR (1-stream) Accuracy 29.7 # 37
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoAGCN (1-stream) Accuracy 33 # 29
GFLOPS per prediction 0.18 # 13
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoST-GCN (2-stream) Accuracy 33.1 # 28
GFLOPS per prediction 0.32 # 8
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoST-GCN (1-stream) Accuracy 31.8 # 33
GFLOPS per prediction 0.16 # 14
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoS-TR* (2-stream) Accuracy 29.9 # 36
GFLOPS per prediction 0.22 # 11
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoS-TR* (1-stream) Accuracy 27.4 # 39
GFLOPS per prediction 0.11 # 16
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoAGCN* (2-stream) Accuracy 27.5 # 38
GFLOPS per prediction 0.25 # 10
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoAGCN* (1-stream) Accuracy 23.3 # 40
GFLOPS per prediction 0.12 # 15
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoST-GCN* (2-stream) Accuracy 32.2 # 31
GFLOPS per prediction 0.22 # 11
Skeleton Based Action Recognition Kinetics-Skeleton dataset CoST-GCN* (1-stream) Accuracy 30.2 # 35
GFLOPS per prediction 0.11 # 16
Skeleton Based Action Recognition Kinetics-Skeleton dataset S-TR (2-stream) Accuracy 34.7 # 22
GFLOPS per prediction 23.24 # 3
Skeleton Based Action Recognition Kinetics-Skeleton dataset S-TR (1-stream) Accuracy 32 # 32
GFLOPS per prediction 11.62 # 6
Skeleton Based Action Recognition Kinetics-Skeleton dataset AGCN (2-stream) Accuracy 36.9 # 16
GFLOPS per prediction 26.91 # 1
Skeleton Based Action Recognition Kinetics-Skeleton dataset AGCN (1-stream) Accuracy 35 # 19
GFLOPS per prediction 13.45 # 4
Skeleton Based Action Recognition Kinetics-Skeleton dataset ST-GCN (2-stream) Accuracy 34.4 # 23
GFLOPS per prediction 24.09 # 2
Skeleton Based Action Recognition Kinetics-Skeleton dataset ST-GCN (1-stream) Accuracy 33.4 # 27
GFLOPS per prediction 12.04 # 5
Skeleton Based Action Recognition NTU RGB+D CoS-TR* (2-stream) Accuracy (CV) 94.8 # 52
Accuracy (CS) 88.9 # 53
GFLOPs per pred 0.3 # 4
Skeleton Based Action Recognition NTU RGB+D CoST-GCN* (2-stream) Accuracy (CV) 95 # 48
Accuracy (CS) 88.3 # 56
GFLOPs per pred 0.32 # 3
Skeleton Based Action Recognition NTU RGB+D CoST-GCN* Accuracy (CV) 93.8 # 59
Accuracy (CS) 86.3 # 69
GFLOPs per pred 0.16 # 5
Skeleton Based Action Recognition NTU RGB+D CoS-TR* Accuracy (CV) 92.4 # 76
Accuracy (CS) 86.3 # 69
GFLOPs per pred 0.15 # 6
Skeleton Based Action Recognition NTU RGB+D CoAGCN* (2-stream) Accuracy (CV) 93.1 # 71
Accuracy (CS) 86.0 # 72
GFLOPs per pred 0.44 # 2
Skeleton Based Action Recognition NTU RGB+D ST-GCN Accuracy (CV) 93.4 # 64
Accuracy (CS) 86 # 72
GFLOPs per pred 16.73 # 1
Skeleton Based Action Recognition NTU RGB+D CoAGCN* Accuracy (CV) 92.6 # 75
Accuracy (CS) 84.1 # 84
Skeleton Based Action Recognition NTU RGB+D 120 S-TR (2-stream) Accuracy (Cross-Subject) 84.8 # 36
Accuracy (Cross-Setup) 86.2 # 36
GFLOPS per prediction 32.4 # 3
Skeleton Based Action Recognition NTU RGB+D 120 S-TR (1-stream) Accuracy (Cross-Subject) 80.2 # 46
Accuracy (Cross-Setup) 81.8 # 46
GFLOPS per prediction 16.2 # 6
Skeleton Based Action Recognition NTU RGB+D 120 AGCN (2-stream) Accuracy (Cross-Subject) 84 # 39
Accuracy (Cross-Setup) 85.4 # 39
GFLOPS per prediction 37.38 # 1
Skeleton Based Action Recognition NTU RGB+D 120 AGCN (1-stream) Accuracy (Cross-Subject) 79.7 # 47
Accuracy (Cross-Setup) 80.7 # 49
GFLOPS per prediction 18.69 # 4
Skeleton Based Action Recognition NTU RGB+D 120 ST-GCN (2-stream) Accuracy (Cross-Subject) 83.7 # 41
Accuracy (Cross-Setup) 85.1 # 40
GFLOPS per prediction 33.46 # 2
Skeleton Based Action Recognition NTU RGB+D 120 ST-GCN (1-stream) Accuracy (Cross-Subject) 79 # 50
GFLOPS per prediction 16.73 # 5
Skeleton Based Action Recognition NTU RGB+D 120 CoST-GCN* (2-stream) Accuracy (Cross-Subject) 84.0 # 39
Accuracy (Cross-Setup) 85.5 # 38
GFLOPS per prediction 0.32 # 8
Skeleton Based Action Recognition NTU RGB+D 120 CoS-TR* (2-stream) Accuracy (Cross-Subject) 84.8 # 36
Accuracy (Cross-Setup) 86.1 # 37
GFLOPS per prediction 0.3 # 9
Skeleton Based Action Recognition NTU RGB+D 120 CoS-TR* (1-stream) Accuracy (Cross-Subject) 79.7 # 47
Accuracy (Cross-Setup) 81.7 # 47
GFLOPS per prediction 0.15 # 12
Skeleton Based Action Recognition NTU RGB+D 120 CoAGCN* (2-stream) Accuracy (Cross-Subject) 80.4 # 45
Accuracy (Cross-Setup) 82 # 45
GFLOPS per prediction 0.44 # 7
Skeleton Based Action Recognition NTU RGB+D 120 CoAGCN* (1-stream) Accuracy (Cross-Subject) 77.3 # 52
Accuracy (Cross-Setup) 79.1 # 51
GFLOPS per prediction 0.22 # 10
Skeleton Based Action Recognition NTU RGB+D 120 CoST-GCN* (1-stream) Accuracy (Cross-Subject) 79.4 # 49
Accuracy (Cross-Setup) 81.6 # 48
GFLOPS per prediction 0.16 # 11

Methods


No methods listed for this paper. Add relevant methods here