TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Human-Object Interaction Detection	HICO-DET	HOICLIP	mAP	34.69	# 13
Human-Object Interaction Detection	V-COCO	HOICLIP	AP(S1)	63.50	# 8
Human-Object Interaction Detection	V-COCO	HOICLIP	AP(S2)	64.81	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hoiclip-efficient-knowledge-transfer-for-hoi/human-object-interaction-detection-on-v-coco)](https://paperswithcode.com/sota/human-object-interaction-detection-on-v-coco?p=hoiclip-efficient-knowledge-transfer-for-hoi)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hoiclip-efficient-knowledge-transfer-for-hoi/human-object-interaction-detection-on-hico)](https://paperswithcode.com/sota/human-object-interaction-detection-on-hico?p=hoiclip-efficient-knowledge-transfer-for-hoi)`

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

CVPR 2023 · Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He ·

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

artanic30/hoiclip official

Tasks

Add Remove

Decoder

Human-Object Interaction Detection

Knowledge Distillation

Object

Transfer Learning

Datasets

HICO-DET

V-COCO

Results from the Paper

Edit

Ranked #8 on Human-Object Interaction Detection on V-COCO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Human-Object Interaction Detection	HICO-DET	HOICLIP	mAP	34.69	# 13	Compare
Human-Object Interaction Detection	V-COCO	HOICLIP	AP(S1)	63.50	# 8	Compare
Human-Object Interaction Detection	V-COCO	HOICLIP	AP(S2)	64.81	# 11	Compare

Methods

Add Remove

CLIP

Edit Social Preview

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove