Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition

The skeleton data convey significant information for action recognition since they can robustly against cluttered backgrounds and illumination variation. In recent years, due to the limited ability to extract spatial–temporal features from skeleton data, the methods based on convolutional neural network (CNN) or recurrent neural network are inferior in recognition accuracy. A series of methods based on graph convolutional networks (GCN) have achieved remarkable performance and gradually become dominant. However, the computational cost of GCN-based methods is quite heavy, several works even over 100 GFLOPs. This is contrary to the highly condensed attributes of skeleton data. In this paper, a novel multi-scale spatial–temporal convolutional (MSST) module is proposed to take the implicit complementary advantages across spatial–temporal representations with different scales. Instead of converting skeleton data into pseudo-images like some previous CNN-based methods or using complex graph convolution, we take full use of multi-scale convolutions on temporal and spatial dimensions to capture comprehensive dependencies of skeleton joints. Unifying the MSST module, a multi-scale spatial–temporal convolutional neural network (MSSTNet) is proposed to capture high-level spatial–temporal semantic features for action recognition. Unlike previous methods which boost performance at the cost of computation, MSSTNet can be easily implemented with light model size and fast inference. Moreover, MSSTNet is used in a four-stream framework to fuse data of different modalities, providing notable improvement to recognition accuracy. On NTU RGB+D 60, NTU RGB+D 120, UAV-Human and Northwestern-UCLA datasets, the proposed MSSTNet achieves competitive performance with much less computational cost than state-of-the-art methods.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Skeleton Based Action Recognition NTU RGB+D MSSTNet Accuracy (CV) 97.8 # 2
Accuracy (CS) 92.6 # 19
Skeleton Based Action Recognition NTU RGB+D 120 MSSTNet Accuracy (Cross-Subject) 87.4 # 24
Accuracy (Cross-Setup) 88.3 # 29
Skeleton Based Action Recognition N-UCLA MSSTNet Accuracy 95.3 # 10
Skeleton Based Action Recognition UAV-Human MSSTNet CSv1(%) 43.0 # 2
CSv2(%) 70.1 # 1

Methods


No methods listed for this paper. Add relevant methods here