TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Animal Pose Estimation	AP-10K	SimpleBaseline-ResNet50	AP	68.1	# 9
Animal Pose Estimation	AP-10K	ViTPose+-S ViT-S	AP	71.4	# 8
Animal Pose Estimation	AP-10K	HRNet-w32	AP	72.2	# 7
Animal Pose Estimation	AP-10K	HRNet-w48	AP	73.1	# 6
Animal Pose Estimation	AP-10K	ViTPose+-H	AP	82.4	# 1
Animal Pose Estimation	AP-10K	ViTPose+-B	AP	74.5	# 5
Animal Pose Estimation	AP-10K	ViTPose+-L	AP	80.4	# 2
2D Human Pose Estimation	COCO-WholeBody	ViTPose+-H	WB	61.2	# 6
2D Human Pose Estimation	COCO-WholeBody	ViTPose+-H	body	75.9	# 1
2D Human Pose Estimation	COCO-WholeBody	ViTPose+-H	foot	77.9	# 2
2D Human Pose Estimation	COCO-WholeBody	ViTPose+-H	face	63.3	# 8
2D Human Pose Estimation	COCO-WholeBody	ViTPose+-H	hand	54.7	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitpose-vision-transformer-foundation-model/animal-pose-estimation-on-ap-10k)](https://paperswithcode.com/sota/animal-pose-estimation-on-ap-10k?p=vitpose-vision-transformer-foundation-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitpose-vision-transformer-foundation-model/2d-human-pose-estimation-on-coco-wholebody-1)](https://paperswithcode.com/sota/2d-human-pose-estimation-on-coco-wholebody-1?p=vitpose-vision-transformer-foundation-model)`

ViTPose++: Vision Transformer for Generic Body Pose Estimation

7 Dec 2022 · Yufei Xu, Jing Zhang, Qiming Zhang, DaCheng Tao ·

In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

PDF Abstract

Code

Add Remove Mark official

vitae-transformer/vitpose official

↳ Quickstart in

Spaces

1,170

Tasks

Add Remove

2D Human Pose Estimation

Animal Pose Estimation

Keypoint Detection

Pose Estimation

Datasets

ImageNet

MS COCO

MPII

OCHuman

AP-10K

COCO-WholeBody AIC

Results from the Paper

Edit

Ranked #1 on Animal Pose Estimation on AP-10K (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Animal Pose Estimation	AP-10K	SimpleBaseline-ResNet50	AP	68.1	# 9	Compare
Animal Pose Estimation	AP-10K	ViTPose+-S ViT-S	AP	71.4	# 8	Compare
Animal Pose Estimation	AP-10K	HRNet-w32	AP	72.2	# 7	Compare
Animal Pose Estimation	AP-10K	HRNet-w48	AP	73.1	# 6	Compare
Animal Pose Estimation	AP-10K	ViTPose+-H	AP	82.4	# 1	Compare
Animal Pose Estimation	AP-10K	ViTPose+-B	AP	74.5	# 5	Compare
Animal Pose Estimation	AP-10K	ViTPose+-L	AP	80.4	# 2	Compare
2D Human Pose Estimation	COCO-WholeBody	ViTPose+-H	WB	61.2	# 6	Compare
			body	75.9	# 1	Compare
			foot	77.9	# 2	Compare
			face	63.3	# 8	Compare
			hand	54.7	# 6	Compare

Methods

Add Remove

Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

ViTPose++: Vision Transformer for Generic Body Pose Estimation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove