TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Open Vocabulary Panoptic Segmentation	ADE20K	CLIPSelf	PQ	23.7	# 3
Open Vocabulary Semantic Segmentation	ADE20K-150	CLIPSelf	mIoU	34.5	# 4
Open Vocabulary Semantic Segmentation	ADE20K-847	CLIPSelf	mIoU	12.4	# 8
Open Vocabulary Object Detection	LVIS v1.0	CLIPSelf	AP novel-LVIS base training	34.9	# 3
Open Vocabulary Object Detection	MSCOCO	CLIPSelf	AP 0.5	44.3	# 5
Open Vocabulary Semantic Segmentation	PASCAL Context-59	CLIPSelf	mIoU	62.3	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipself-vision-transformer-distills-itself/open-vocabulary-panoptic-segmentation-on)](https://paperswithcode.com/sota/open-vocabulary-panoptic-segmentation-on?p=clipself-vision-transformer-distills-itself)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipself-vision-transformer-distills-itself/open-vocabulary-object-detection-on-lvis-v1-0)](https://paperswithcode.com/sota/open-vocabulary-object-detection-on-lvis-v1-0?p=clipself-vision-transformer-distills-itself)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipself-vision-transformer-distills-itself/open-vocabulary-semantic-segmentation-on-1)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-1?p=clipself-vision-transformer-distills-itself)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipself-vision-transformer-distills-itself/open-vocabulary-semantic-segmentation-on-2)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-2?p=clipself-vision-transformer-distills-itself)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipself-vision-transformer-distills-itself/open-vocabulary-object-detection-on-mscoco)](https://paperswithcode.com/sota/open-vocabulary-object-detection-on-mscoco?p=clipself-vision-transformer-distills-itself)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clipself-vision-transformer-distills-itself/open-vocabulary-semantic-segmentation-on-3)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-3?p=clipself-vision-transformer-distills-itself)`

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

2 Oct 2023 · Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy ·

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

PDF Abstract

Code

Add Remove Mark official

wusize/clipself official

133

Tasks

Add Remove

Image Classification

Image Segmentation

object-detection

Object Detection

Open Vocabulary Object Detection

Open Vocabulary Panoptic Segmentation

Open Vocabulary Semantic Segmentation

Panoptic Segmentation

Segmentation

Semantic Segmentation

Zero-Shot Image Classification

Datasets

MS COCO

ADE20K

LVIS

PASCAL Context MSCOCO

Results from the Paper

Add Remove

Ranked #3 on Open Vocabulary Semantic Segmentation on PASCAL Context-59

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Open Vocabulary Panoptic Segmentation	ADE20K	CLIPSelf	PQ	23.7	# 3	Compare
Open Vocabulary Semantic Segmentation	ADE20K-150	CLIPSelf	mIoU	34.5	# 4	Compare
Open Vocabulary Semantic Segmentation	ADE20K-847	CLIPSelf	mIoU	12.4	# 8	Compare
Open Vocabulary Object Detection	LVIS v1.0	CLIPSelf	AP novel-LVIS base training	34.9	# 3	Compare
Open Vocabulary Object Detection	MSCOCO	CLIPSelf	AP 0.5	44.3	# 5	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	CLIPSelf	mIoU	62.3	# 3	Compare

Methods

Add Remove

CLIP

Edit Social Preview

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove