TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	TTD (MaskCLIP)	Mean IoU (val)	12.7	# 5
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	TTD (TCL)	Mean IoU (val)	17.0	# 3
Open Vocabulary Semantic Segmentation	ADE20K-150	TTD (TCL)	mIoU	17.0	# 15
Open Vocabulary Semantic Segmentation	ADE20K-150	TTD (MaskCLIP)	mIoU	12.7	# 16
Semantic Segmentation	CC3M-TagMask	TTD (TCL)	mIoU	65.5	# 1
Semantic Segmentation	CC3M-TagMask	TTD (MaskCLIP)	mIoU	50.2	# 3
Multi-Label Text Classification	CC3M-TagMask	TTD (w/o fine-tuning)	F1	78.5	# 2
Multi-Label Text Classification	CC3M-TagMask	TTD (w/o fine-tuning)	Precision	82.9	# 2
Multi-Label Text Classification	CC3M-TagMask	TTD (w/o fine-tuning)	Recall	74.5	# 3
Multi-Label Text Classification	CC3M-TagMask	TTD (w/o fine-tuning)	mAP	90.3	# 2
Multi-Label Text Classification	CC3M-TagMask	TTD (w/o fine-tuning)	Accuracy	91.0	# 1
Multi-Label Text Classification	CC3M-TagMask	TTD (w/ fine-tuning)	F1	82.8	# 1
Multi-Label Text Classification	CC3M-TagMask	TTD (w/ fine-tuning)	Precision	88.3	# 1
Multi-Label Text Classification	CC3M-TagMask	TTD (w/ fine-tuning)	Recall	78.0	# 2
Multi-Label Text Classification	CC3M-TagMask	TTD (w/ fine-tuning)	mAP	93.7	# 1
Multi-Label Text Classification	CC3M-TagMask	TTD (w/ fine-tuning)	Accuracy	88.6	# 2
Open Vocabulary Semantic Segmentation	Cityscapes	TTD (MaskCLIP)	mIoU	27.0	# 5
Open Vocabulary Semantic Segmentation	Cityscapes	TTD (TCL)	mIoU	32.0	# 3
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	TTD (MaskCLIP)	mIoU	32.0	# 1
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	TTD (TCL)	mIoU	27.0	# 3
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	TTD (MaskCLIP)	mIoU	26.5	# 6
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	TTD (TCL)	mIoU	37.4	# 1
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	TTD (TCL)	mIoU	23.7	# 1
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	TTD (MaskCLIP)	mIoU	19.4	# 3
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	TTD (MaskCLIP)	mIoU	19.4	# 4
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	TTD (TCL)	mIoU	23.7	# 2
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	TTD (MaskCLIP)	mIoU	31.0	# 4
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	TTD (TCL)	mIoU	37.4	# 2
Open Vocabulary Semantic Segmentation	PASCAL Context-59	TTD (TCL)	mIoU	37.4	# 15
Open Vocabulary Semantic Segmentation	PASCAL Context-59	TTD (MaskCLIP)	mIoU	31.0	# 17
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	TTD (MaskCLIP)	mIoU	43.1	# 5
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	TTD (TCL)	mIoU	61.1	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/semantic-segmentation-on-cc3m-tagmask)](https://paperswithcode.com/sota/semantic-segmentation-on-cc3m-tagmask?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/multi-label-text-classification-on-cc3m)](https://paperswithcode.com/sota/multi-label-text-classification-on-cc3m?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/unsupervised-semantic-segmentation-with-3)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-3?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/unsupervised-semantic-segmentation-with-10)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-10?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/open-vocabulary-semantic-segmentation-on-coco)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-coco?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/unsupervised-semantic-segmentation-with-9)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-9?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/unsupervised-semantic-segmentation-with-8)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-8?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/unsupervised-semantic-segmentation-with-11)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-11?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/unsupervised-semantic-segmentation-with-4)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-4?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/open-vocabulary-semantic-segmentation-on)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/open-vocabulary-semantic-segmentation-on-2)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-2?p=ttd-text-tag-self-distillation-enhancing)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ttd-text-tag-self-distillation-enhancing/open-vocabulary-semantic-segmentation-on-1)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-1?p=ttd-text-tag-self-distillation-enhancing)`

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

30 Mar 2024 · Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, KyungSu Kim ·

We identify a critical bias in contemporary CLIP-based models, which we denote as \textit{single tag bias}. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to an imbalanced tag relevancy. This results in an uneven alignment among multiple tags present in the text. To tackle this challenge, we introduce a novel two-step fine-tuning approach. First, our method leverages the similarity between tags and their nearest pixels for scoring, enabling the extraction of image-relevant tags from the text. Second, we present a self-distillation strategy aimed at aligning the combined masks from extracted tags with the text-derived mask. This approach mitigates the single tag bias, thereby significantly improving the alignment of CLIP's model without necessitating additional data or supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. Code is available at https://github.com/shjo-april/TTD.

PDF Abstract

Code

Add Remove Mark official

shjo-april/TTD official

Tasks

Add Remove

Multi-Label Text Classification

Open Vocabulary Semantic Segmentation

Semantic Segmentation

TAG

Unsupervised Semantic Segmentation with Language-image Pre-training

Datasets

Introduced in the Paper:

CC3M-TagMask

Used in the Paper:

Cityscapes

ADE20K

PASCAL Context

COCO-Stuff

PASCAL VOC

Results from the Paper

Edit

Ranked #1 on Open Vocabulary Semantic Segmentation on COCO-Stuff-171 (mIoU metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	TTD (MaskCLIP)	Mean IoU (val)	12.7	# 5	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	TTD (TCL)	Mean IoU (val)	17.0	# 3	Compare
Open Vocabulary Semantic Segmentation	ADE20K-150	TTD (TCL)	mIoU	17.0	# 15	Compare
Open Vocabulary Semantic Segmentation	ADE20K-150	TTD (MaskCLIP)	mIoU	12.7	# 16	Compare
Semantic Segmentation	CC3M-TagMask	TTD (TCL)	mIoU	65.5	# 1	Compare
Semantic Segmentation	CC3M-TagMask	TTD (MaskCLIP)	mIoU	50.2	# 3	Compare
Multi-Label Text Classification	CC3M-TagMask	TTD (w/o fine-tuning)	F1	78.5	# 2	Compare
			Precision	82.9	# 2	Compare
			Recall	74.5	# 3	Compare
			mAP	90.3	# 2	Compare
			Accuracy	91.0	# 1	Compare
Multi-Label Text Classification	CC3M-TagMask	TTD (w/ fine-tuning)	F1	82.8	# 1	Compare
			Precision	88.3	# 1	Compare
			Recall	78.0	# 2	Compare
			mAP	93.7	# 1	Compare
			Accuracy	88.6	# 2	Compare
Open Vocabulary Semantic Segmentation	Cityscapes	TTD (MaskCLIP)	mIoU	27.0	# 5	Compare
Open Vocabulary Semantic Segmentation	Cityscapes	TTD (TCL)	mIoU	32.0	# 3	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	TTD (MaskCLIP)	mIoU	32.0	# 1	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	TTD (TCL)	mIoU	27.0	# 3	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	TTD (MaskCLIP)	mIoU	26.5	# 6	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	TTD (TCL)	mIoU	37.4	# 1	Compare
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	TTD (TCL)	mIoU	23.7	# 1	Compare
Open Vocabulary Semantic Segmentation	COCO-Stuff-171	TTD (MaskCLIP)	mIoU	19.4	# 3	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	TTD (MaskCLIP)	mIoU	19.4	# 4	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	TTD (TCL)	mIoU	23.7	# 2	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	TTD (MaskCLIP)	mIoU	31.0	# 4	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	TTD (TCL)	mIoU	37.4	# 2	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	TTD (TCL)	mIoU	37.4	# 15	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	TTD (MaskCLIP)	mIoU	31.0	# 17	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	TTD (MaskCLIP)	mIoU	43.1	# 5	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	TTD (TCL)	mIoU	61.1	# 2	Compare

Methods

Add Remove

Focus

Edit Social Preview

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove