TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

30 Mar 2024  ยท  Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, KyungSu Kim ยท

We identify a critical bias in contemporary CLIP-based models, which we denote as \textit{single tag bias}. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to an imbalanced tag relevancy. This results in an uneven alignment among multiple tags present in the text. To tackle this challenge, we introduce a novel two-step fine-tuning approach. First, our method leverages the similarity between tags and their nearest pixels for scoring, enabling the extraction of image-relevant tags from the text. Second, we present a self-distillation strategy aimed at aligning the combined masks from extracted tags with the text-derived mask. This approach mitigates the single tag bias, thereby significantly improving the alignment of CLIP's model without necessitating additional data or supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. Code is available at https://github.com/shjo-april/TTD.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TTD (MaskCLIP) Mean IoU (val) 12.7 # 5
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TTD (TCL) Mean IoU (val) 17.0 # 3
Open Vocabulary Semantic Segmentation ADE20K-150 TTD (TCL) mIoU 17.0 # 15
Open Vocabulary Semantic Segmentation ADE20K-150 TTD (MaskCLIP) mIoU 12.7 # 16
Semantic Segmentation CC3M-TagMask TTD (TCL) mIoU 65.5 # 1
Semantic Segmentation CC3M-TagMask TTD (MaskCLIP) mIoU 50.2 # 3
Multi-Label Text Classification CC3M-TagMask TTD (w/o fine-tuning) F1 78.5 # 2
Precision 82.9 # 2
Recall 74.5 # 3
mAP 90.3 # 2
Accuracy 91.0 # 1
Multi-Label Text Classification CC3M-TagMask TTD (w/ fine-tuning) F1 82.8 # 1
Precision 88.3 # 1
Recall 78.0 # 2
mAP 93.7 # 1
Accuracy 88.6 # 2
Open Vocabulary Semantic Segmentation Cityscapes TTD (MaskCLIP) mIoU 27.0 # 5
Open Vocabulary Semantic Segmentation Cityscapes TTD (TCL) mIoU 32.0 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TTD (MaskCLIP) mIoU 32.0 # 1
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TTD (TCL) mIoU 27.0 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TTD (MaskCLIP) mIoU 26.5 # 6
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TTD (TCL) mIoU 37.4 # 1
Open Vocabulary Semantic Segmentation COCO-Stuff-171 TTD (TCL) mIoU 23.7 # 1
Open Vocabulary Semantic Segmentation COCO-Stuff-171 TTD (MaskCLIP) mIoU 19.4 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TTD (MaskCLIP) mIoU 19.4 # 4
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TTD (TCL) mIoU 23.7 # 2
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TTD (MaskCLIP) mIoU 31.0 # 4
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TTD (TCL) mIoU 37.4 # 2
Open Vocabulary Semantic Segmentation PASCAL Context-59 TTD (TCL) mIoU 37.4 # 15
Open Vocabulary Semantic Segmentation PASCAL Context-59 TTD (MaskCLIP) mIoU 31.0 # 17
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TTD (MaskCLIP) mIoU 43.1 # 5
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TTD (TCL) mIoU 61.1 # 2

Methods