TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Open Vocabulary Semantic Segmentation	ADE20K-150	MAFT-ViTL	mIoU	32.0	# 8
Open Vocabulary Semantic Segmentation	ADE20K-847	MAFT-ViTL	mIoU	12.1	# 9
Open Vocabulary Semantic Segmentation	PASCAL Context-459	MAFT-ViTL	mIoU	15.7	# 7
Open Vocabulary Semantic Segmentation	PASCAL Context-59	MAFT-ViTL	mIoU	58.5	# 7
Open Vocabulary Semantic Segmentation	PascalVOC-20	MAFT-ViTL	mIoU	92.1	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-mask-aware-clip-representations-for/open-vocabulary-semantic-segmentation-on-7)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-7?p=learning-mask-aware-clip-representations-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-mask-aware-clip-representations-for/open-vocabulary-semantic-segmentation-on-1)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-1?p=learning-mask-aware-clip-representations-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-mask-aware-clip-representations-for/open-vocabulary-semantic-segmentation-on-5)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-5?p=learning-mask-aware-clip-representations-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-mask-aware-clip-representations-for/open-vocabulary-semantic-segmentation-on-2)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-2?p=learning-mask-aware-clip-representations-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-mask-aware-clip-representations-for/open-vocabulary-semantic-segmentation-on-3)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-3?p=learning-mask-aware-clip-representations-for)`

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

NeurIPS 2023 · Siyu Jiao, Yunchao Wei, YaoWei Wang, Yao Zhao, Humphrey Shi ·

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. The code is available at https://github.com/jiaosiyu1999/MAFT.git.

PDF Abstract NeurIPS 2023 PDF NeurIPS 2023 Abstract

Code

Add Remove Mark official

jiaosiyu1999/maft official

Tasks

Add Remove

Open Vocabulary Semantic Segmentation

Zero Shot Segmentation

Datasets

ADE20K

PASCAL Context

COCO-Stuff

PASCAL VOC

Results from the Paper

Edit

Ranked #7 on Open Vocabulary Semantic Segmentation on PascalVOC-20

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Open Vocabulary Semantic Segmentation	ADE20K-150	MAFT-ViTL	mIoU	32.0	# 8	Compare
Open Vocabulary Semantic Segmentation	ADE20K-847	MAFT-ViTL	mIoU	12.1	# 9	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-459	MAFT-ViTL	mIoU	15.7	# 7	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	MAFT-ViTL	mIoU	58.5	# 7	Compare
Open Vocabulary Semantic Segmentation	PascalVOC-20	MAFT-ViTL	mIoU	92.1	# 7	Compare

Methods

Add Remove

CLIP

Edit Social Preview

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove