TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

21 Dec 2023  ยท  Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen ยท

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TagAlign Mean IoU (val) 17.3 # 1
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TagAlign mIoU 27.5 # 2
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TagAlign mIoU 33.3 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TagAlign mIoU 25.3 # 1
Open Vocabulary Semantic Segmentation PASCAL Context-59 TaAlign(trained with image-text pairs) mIoU 37.6 # 14
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TagAlign mIoU 37.6 # 1
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TagAlign mIoU 53.9 # 4
Open Vocabulary Semantic Segmentation PascalVOC-20 TagAlign(trained with image-text pairs) mIoU 87.9 # 10
Unsupervised Semantic Segmentation with Language-image Pre-training PascalVOC-20 TagAlign mIoU 87.9 # 1

Methods


No methods listed for this paper. Add relevant methods here