MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Monocular 3D Human Pose Estimation Human3.6M MixSTE (HRNet, T=243) Average MPJPE (mm) 39.8 # 6
Use Video Sequence Yes # 1
Frames Needed 243 # 33
Need Ground Truth 2D Pose No # 1
2D detector HRNet # 1
3D Human Pose Estimation Human3.6M MixSTE (T=243 GT) Average MPJPE (mm) 21.6 # 14
Using 2D ground-truth joints Yes # 2
Multi-View or Monocular Monocular # 1
3D Human Pose Estimation Human3.6M MixSTE (T=81 GT) Average MPJPE (mm) 25.9 # 21
Using 2D ground-truth joints Yes # 2
Multi-View or Monocular Monocular # 1
3D Human Pose Estimation Human3.6M MixSTE (HRNet, T=243) Average MPJPE (mm) 39.8 # 70
Using 2D ground-truth joints No # 2
Multi-View or Monocular Monocular # 1
3D Human Pose Estimation Human3.6M MixSTE (CPN, T=243) Average MPJPE (mm) 40.9 # 75
Using 2D ground-truth joints No # 2
Multi-View or Monocular Monocular # 1
3D Human Pose Estimation Human3.6M MixSTE (CPN, T=81) Average MPJPE (mm) 42.4 # 80
Using 2D ground-truth joints No # 2
Multi-View or Monocular Monocular # 1
3D Human Pose Estimation HumanEva-I MixSTE (T=43, FT) Mean Reconstruction Error (mm) 16.1 # 7
3D Human Pose Estimation MPI-INF-3DHP MixSTE (T=27) AUC 66.5 # 18
MPJPE 54.9 # 19
PCK 94.4 # 18
3D Human Pose Estimation MPI-INF-3DHP MixSTE (T=1) AUC 63.8 # 20
MPJPE 57.9 # 21
PCK 94.2 # 19

Methods