TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	CLS-SEG	mIoU	35.3	# 2
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-27	CLS-SEG	mIoU	31.0	# 2
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	CLS-SEG	mIoU	68.7	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagclip-a-local-to-global-framework-to/unsupervised-semantic-segmentation-with-11)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-11?p=tagclip-a-local-to-global-framework-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagclip-a-local-to-global-framework-to/unsupervised-semantic-segmentation-with-10)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-10?p=tagclip-a-local-to-global-framework-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagclip-a-local-to-global-framework-to/unsupervised-semantic-segmentation-with-1)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-1?p=tagclip-a-local-to-global-framework-to)`

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

20 Dec 2023 · Yuqi Lin, Minghao Chen, Kaipeng Zhang, Hengjia Li, Mingming Li, Zheng Yang, Dongqin Lv, Binbin Lin, Haifeng Liu, Deng Cai ·

Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.

PDF Abstract

Code

Add Remove Mark official

linyq2117/tagclip official

Tasks

Add Remove

Classification

Multi-Label Classification

Semantic Segmentation

Unsupervised Semantic Segmentation with Language-image Pre-training

Weakly supervised Semantic Segmentation

Weakly-Supervised Semantic Segmentation

Datasets

MS COCO

COCO-Stuff

PASCAL VOC

Results from the Paper

Add Remove

Ranked #1 on Unsupervised Semantic Segmentation with Language-image Pre-training on PASCAL VOC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	CLS-SEG	mIoU	35.3	# 2	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-27	CLS-SEG	mIoU	31.0	# 2	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	CLS-SEG	mIoU	68.7	# 1	Compare

Methods

Add Remove

CLIP • Softmax

Edit Social Preview

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove