TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Emotion Recognition	EMOTIC	FocusCLIP	Top-3 Accuracy (%)	13.73	# 1
Age Classification	EMOTIC	CLIP	Top-1 Accuracy (%)	37.56	# 2
Age Classification	EMOTIC	FocusCLIP	Top-1 Accuracy (%)	41.80	# 1
Activity Recognition	Stanford40	CLIP	Top-3 Accuracy (%)	6.49	# 2
Activity Recognition	Stanford40	FocusCLIP	Top-3 Accuracy (%)	10.47	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focusclip-multimodal-subject-level-guidance/emotion-recognition-on-emotic)](https://paperswithcode.com/sota/emotion-recognition-on-emotic?p=focusclip-multimodal-subject-level-guidance)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focusclip-multimodal-subject-level-guidance/age-classification-on-emotic)](https://paperswithcode.com/sota/age-classification-on-emotic?p=focusclip-multimodal-subject-level-guidance)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focusclip-multimodal-subject-level-guidance/activity-recognition-on-stanford40)](https://paperswithcode.com/sota/activity-recognition-on-stanford40?p=focusclip-multimodal-subject-level-guidance)`

FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks

11 Mar 2024 · Muhammad Saif Ullah Khan, Muhammad Ferjad Naeem, Federico Tombari, Luc van Gool, Didier Stricker, Muhammad Zeshan Afzal ·

We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regions. On the text side, we introduce human pose descriptions to provide rich contextual information. For human-centric tasks, FocusCLIP is trained with images from the MPII Human Pose dataset. The proposed approach surpassed CLIP by an average of 8.61% across five previously unseen datasets covering three human-centric tasks. FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP. We observed a 3.98% improvement in activity recognition, a 14.78% improvement in age classification, and a 7.06% improvement in emotion recognition. Moreover, using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset to encourage further research in multimodal learning for human-centric tasks. Furthermore, we also demonstrate the effectiveness of our subject-level supervision on non-human-centric tasks. FocusCLIP shows a 2.47% improvement over CLIP in zero-shot bird classification using the CUB dataset. Our findings emphasize the potential of integrating subject-level guidance with general pretraining methods for enhanced downstream performance.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Activity Recognition

Age Classification

Emotion Recognition

Datasets

Introduced in the Paper:

MPII Human Pose Descriptions

Used in the Paper:

CUB-200-2011

MPII

UTKFace

MPII Human Pose

FER+

EMOTIC

LAGENDA

Results from the Paper

Add Remove

Ranked #1 on Emotion Recognition on EMOTIC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Emotion Recognition	EMOTIC	FocusCLIP	Top-3 Accuracy (%)	13.73	# 1	Compare
Age Classification	EMOTIC	CLIP	Top-1 Accuracy (%)	37.56	# 2	Compare
Age Classification	EMOTIC	FocusCLIP	Top-1 Accuracy (%)	41.80	# 1	Compare
Activity Recognition	Stanford40	CLIP	Top-3 Accuracy (%)	6.49	# 2	Compare
Activity Recognition	Stanford40	FocusCLIP	Top-3 Accuracy (%)	10.47	# 1	Compare

Methods

Add Remove

CLIP • Visual Attention

Edit Social Preview

FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove