TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Monocular 3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Average MPJPE (mm)	39.8	# 6
Monocular 3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Use Video Sequence	Yes	# 1
Monocular 3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Frames Needed	243	# 33
Monocular 3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Need Ground Truth 2D Pose	No	# 1
Monocular 3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	2D detector	HRNet	# 1
3D Human Pose Estimation	Human3.6M	MixSTE (T=243 GT)	Average MPJPE (mm)	21.6	# 14
3D Human Pose Estimation	Human3.6M	MixSTE (T=243 GT)	Using 2D ground-truth joints	Yes	# 2
3D Human Pose Estimation	Human3.6M	MixSTE (T=243 GT)	Multi-View or Monocular	Monocular	# 1
3D Human Pose Estimation	Human3.6M	MixSTE (T=81 GT)	Average MPJPE (mm)	25.9	# 21
3D Human Pose Estimation	Human3.6M	MixSTE (T=81 GT)	Using 2D ground-truth joints	Yes	# 2
3D Human Pose Estimation	Human3.6M	MixSTE (T=81 GT)	Multi-View or Monocular	Monocular	# 1
3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Average MPJPE (mm)	39.8	# 70
3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Using 2D ground-truth joints	No	# 2
3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Multi-View or Monocular	Monocular	# 1
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=243)	Average MPJPE (mm)	40.9	# 75
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=243)	Using 2D ground-truth joints	No	# 2
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=243)	Multi-View or Monocular	Monocular	# 1
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=81)	Average MPJPE (mm)	42.4	# 80
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=81)	Using 2D ground-truth joints	No	# 2
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=81)	Multi-View or Monocular	Monocular	# 1
3D Human Pose Estimation	HumanEva-I	MixSTE (T=43, FT)	Mean Reconstruction Error (mm)	16.1	# 7
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=27)	AUC	66.5	# 18
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=27)	MPJPE	54.9	# 19
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=27)	PCK	94.4	# 18
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=1)	AUC	63.8	# 20
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=1)	MPJPE	57.9	# 21
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=1)	PCK	94.2	# 19

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixste-seq2seq-mixed-spatio-temporal-encoder/monocular-3d-human-pose-estimation-on-human3)](https://paperswithcode.com/sota/monocular-3d-human-pose-estimation-on-human3?p=mixste-seq2seq-mixed-spatio-temporal-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixste-seq2seq-mixed-spatio-temporal-encoder/3d-human-pose-estimation-on-humaneva-i)](https://paperswithcode.com/sota/3d-human-pose-estimation-on-humaneva-i?p=mixste-seq2seq-mixed-spatio-temporal-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixste-seq2seq-mixed-spatio-temporal-encoder/3d-human-pose-estimation-on-human36m)](https://paperswithcode.com/sota/3d-human-pose-estimation-on-human36m?p=mixste-seq2seq-mixed-spatio-temporal-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mixste-seq2seq-mixed-spatio-temporal-encoder/3d-human-pose-estimation-on-mpi-inf-3dhp)](https://paperswithcode.com/sota/3d-human-pose-estimation-on-mpi-inf-3dhp?p=mixste-seq2seq-mixed-spatio-temporal-encoder)`

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

CVPR 2022 · Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, Junsong Yuan ·

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

JinluZhang1126/MixSTE official

181

Tasks

Add Remove

3D Human Pose Estimation

Monocular 3D Human Pose Estimation

Pose Estimation

Datasets

Human3.6M

MPI-INF-3DHP

Results from the Paper

Edit

Ranked #6 on Monocular 3D Human Pose Estimation on Human3.6M

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Monocular 3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Average MPJPE (mm)	39.8	# 6	Compare
			Use Video Sequence	Yes	# 1	Compare
			Frames Needed	243	# 33	Compare
			Need Ground Truth 2D Pose	No	# 1	Compare
			2D detector	HRNet	# 1	Compare
3D Human Pose Estimation	Human3.6M	MixSTE (T=243 GT)	Average MPJPE (mm)	21.6	# 14	Compare
			Using 2D ground-truth joints	Yes	# 2	Compare
			Multi-View or Monocular	Monocular	# 1	Compare
3D Human Pose Estimation	Human3.6M	MixSTE (T=81 GT)	Average MPJPE (mm)	25.9	# 21	Compare
			Using 2D ground-truth joints	Yes	# 2	Compare
			Multi-View or Monocular	Monocular	# 1	Compare
3D Human Pose Estimation	Human3.6M	MixSTE (HRNet, T=243)	Average MPJPE (mm)	39.8	# 70	Compare
			Using 2D ground-truth joints	No	# 2	Compare
			Multi-View or Monocular	Monocular	# 1	Compare
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=243)	Average MPJPE (mm)	40.9	# 75	Compare
			Using 2D ground-truth joints	No	# 2	Compare
			Multi-View or Monocular	Monocular	# 1	Compare
3D Human Pose Estimation	Human3.6M	MixSTE (CPN, T=81)	Average MPJPE (mm)	42.4	# 80	Compare
			Using 2D ground-truth joints	No	# 2	Compare
			Multi-View or Monocular	Monocular	# 1	Compare
3D Human Pose Estimation	HumanEva-I	MixSTE (T=43, FT)	Mean Reconstruction Error (mm)	16.1	# 7	Compare
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=27)	AUC	66.5	# 18	Compare
			MPJPE	54.9	# 19	Compare
			PCK	94.4	# 18	Compare
3D Human Pose Estimation	MPI-INF-3DHP	MixSTE (T=1)	AUC	63.8	# 20	Compare
			MPJPE	57.9	# 21	Compare
			PCK	94.2	# 19	Compare

Methods

Add Remove

Spatial Transformer

Edit Social Preview

MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove