Hulk: A Universal Knowledge Translator for Human-Centric Tasks

4 Dec 2023  ยท  Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang ยท

Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
3D Human Pose Estimation 3DPW Hulk(ViT-L) PA-MPJPE 38.5 # 3
MPJPE 66.3 # 13
MPVPE 77.4 # 9
3D Human Pose Estimation 3DPW Hulk(ViT-B) PA-MPJPE 39.9 # 8
MPJPE 67 # 14
MPVPE 79.8 # 15
Pose Estimation AIC Hulk(Finetune, ViT-B) AP 35.6 # 2
Pose Estimation AIC Hulk(Finetune, ViT-L) AP 37.1 # 1
Human Part Segmentation CIHP Hulk(Finetune, ViT-L) Mean IoU 72.68 # 1
Human Part Segmentation CIHP Hulk(Finetune, ViT-B) Mean IoU 71.26 # 2
Object Detection CrowdHuman (full body) Hulk(Finetune, ViT-B) AP 92.4 # 8
mMR 40.7 # 6
Object Detection CrowdHuman (full body) Hulk(Finetune, ViT-L) AP 93 # 5
mMR 36.5 # 1
Pedestrian Image Caption CUHK-PEDES Hulk (ViT-B) BLEU4 31.1 # 2
CIDEr 91.4 # 2
Pedestrian Image Caption CUHK-PEDES Hulk (ViT-L) BLEU4 31.6 # 1
CIDEr 94.5 # 1
Human Part Segmentation Human3.6M Hulk(Finetune, ViT-L) mIoU 69.89 # 1
Human Part Segmentation Human3.6M Hulk(Finetune, ViT-B) mIoU 68.56 # 2
Semantic Segmentation LIP val Hulk(Finetune, ViT-B) mIoU 63.98% # 2
Semantic Segmentation LIP val Hulk(Finetune, ViT-L) mIoU 66.02% # 1
Pose Estimation MS COCO Hulk(Finetune, ViT-L) AP 78.7 # 3
Pose Estimation MS COCO Hulk(Finetune, ViT-B) AP 77.5 # 5
Skeleton Based Action Recognition NTU RGB+D Hulk(Finetune, ViT-B) Accuracy (CS) 94 # 3
Skeleton Based Action Recognition NTU RGB+D Hulk(Finetune, ViT-L) Accuracy (CS) 94.3 # 1
Pedestrian Attribute Recognition PA-100K Hulk(Finetune, ViT-L) Accuracy 88.97 # 2
Pedestrian Attribute Recognition PA-100K Hulk(Finetune, ViT-B) Accuracy 87.85 # 3
Pedestrian Attribute Recognition RAPv2 Hulk(Finetune, ViT-B) Accuracy 85.26 # 2
Pedestrian Attribute Recognition RAPv2 Hulk(Finetune, ViT-L) Accuracy 85.86 # 1

Methods


No methods listed for this paper. Add relevant methods here