TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Skeleton Based Action Recognition	NTU RGB+D	STEP-CATFormer	Accuracy (CV)	97.3	# 7
Skeleton Based Action Recognition	NTU RGB+D	STEP-CATFormer	Accuracy (CS)	93.2	# 7
Skeleton Based Action Recognition	NTU RGB+D	STEP-CATFormer	Ensembled Modalities	4	# 2
Skeleton Based Action Recognition	NTU RGB+D 120	STEP-CATFormer	Accuracy (Cross-Subject)	90.0	# 5
Skeleton Based Action Recognition	NTU RGB+D 120	STEP-CATFormer	Accuracy (Cross-Setup)	91.2	# 9
Skeleton Based Action Recognition	NTU RGB+D 120	STEP-CATFormer	Ensembled Modalities	4	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/step-catformer-spatial-temporal-effective/skeleton-based-action-recognition-on-ntu-rgbd-1)](https://paperswithcode.com/sota/skeleton-based-action-recognition-on-ntu-rgbd-1?p=step-catformer-spatial-temporal-effective)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/step-catformer-spatial-temporal-effective/skeleton-based-action-recognition-on-ntu-rgbd)](https://paperswithcode.com/sota/skeleton-based-action-recognition-on-ntu-rgbd?p=step-catformer-spatial-temporal-effective)`

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

6 Dec 2023 · Nguyen Huu Bao Long ·

Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer

PDF Abstract

Code

Add Remove Mark official

maclong01/STEP-CATFormer official

Tasks

Add Remove

Action Recognition

Skeleton Based Action Recognition

Datasets

NTU RGB+D

NTU RGB+D 120

Results from the Paper

Edit

Ranked #5 on Skeleton Based Action Recognition on NTU RGB+D 120 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Skeleton Based Action Recognition	NTU RGB+D	STEP-CATFormer	Accuracy (CV)	97.3	# 7	Compare
			Accuracy (CS)	93.2	# 7	Compare
			Ensembled Modalities	4	# 2	Compare
Skeleton Based Action Recognition	NTU RGB+D 120	STEP-CATFormer	Accuracy (Cross-Subject)	90.0	# 5	Compare
			Accuracy (Cross-Setup)	91.2	# 9	Compare
			Ensembled Modalities	4	# 1	Compare

Methods

Add Remove

Convolution • Focus • Temporal attention • Transformer

Edit Social Preview

STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove