TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
3D Human Pose Estimation	Human3.6M	STCFormer	Average MPJPE (mm)	21.3	# 13
3D Human Pose Estimation	Human3.6M	STCFormer	Using 2D ground-truth joints	Yes	# 2
3D Human Pose Estimation	Human3.6M	STCFormer	Multi-View or Monocular	Monocular	# 1
3D Human Pose Estimation	MPI-INF-3DHP	STCFormer (T=81)	AUC	83.9	# 5
3D Human Pose Estimation	MPI-INF-3DHP	STCFormer (T=81)	MPJPE	23.1	# 6
3D Human Pose Estimation	MPI-INF-3DHP	STCFormer (T=81)	PCK	98.7	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/3d-human-pose-estimation-with-spatio-temporal/3d-human-pose-estimation-on-mpi-inf-3dhp)](https://paperswithcode.com/sota/3d-human-pose-estimation-on-mpi-inf-3dhp?p=3d-human-pose-estimation-with-spatio-temporal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/3d-human-pose-estimation-with-spatio-temporal/3d-human-pose-estimation-on-human36m)](https://paperswithcode.com/sota/3d-human-pose-estimation-on-human36m?p=3d-human-pose-estimation-with-spatio-temporal)`

3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention

CVPR 2023 · Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, Ting Yao ·

Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost has a quadratic growth with the increasing number of joints. Such drawback becomes even worse especially for pose estimation in a video sequence, which necessitates spatio-temporal correlation spanning over the entire video. In this paper, we facilitate the issue by decomposing correlation learning into space and time, and present a novel Spatio-Temporal Criss-cross attention (STC) block. Technically, STC first slices its input feature into two partitions evenly along the channel dimension, followed by performing spatial and temporal attention respectively on each partition. STC then models the interactions between joints in an identical frame and joints in an identical trajectory simultaneously by concatenating the outputs from attention layers. On this basis, we devise STCFormer by stacking multiple STC blocks and further integrate a new Structure-enhanced Positional Embedding (SPE) into STCFormer to take the structure of human body into consideration. The embedding function consists of two components: spatio-temporal convolution around neighboring joints to capture local structure, and part-aware embedding to indicate which part each joint belongs to. Extensive experiments are conducted on Human3.6M and MPI-INF-3DHP benchmarks, and superior results are reported when comparing to the state-of-the-art approaches. More remarkably, STCFormer achieves to-date the best published performance: 40.5mm P1 error on the challenging Human3.6M dataset.

PDF Abstract

Code

Add Remove Mark official

zhenhuat/STCFormer official

Tasks

Add Remove

3D Human Pose Estimation

Pose Estimation

Datasets

Human3.6M

MPI-INF-3DHP

Results from the Paper

Add Remove

Ranked #6 on 3D Human Pose Estimation on MPI-INF-3DHP

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
3D Human Pose Estimation	Human3.6M	STCFormer	Average MPJPE (mm)	21.3	# 13	Compare
			Using 2D ground-truth joints	Yes	# 2	Compare
			Multi-View or Monocular	Monocular	# 1	Compare
3D Human Pose Estimation	MPI-INF-3DHP	STCFormer (T=81)	AUC	83.9	# 5	Compare
			MPJPE	23.1	# 6	Compare
			PCK	98.7	# 4	Compare

Methods

Add Remove

Convolution • Temporal attention

Edit Social Preview

3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove