TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	TagAlign	Mean IoU (val)	17.3	# 1
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	TagAlign	mIoU	27.5	# 2
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	TagAlign	mIoU	33.3	# 3
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	TagAlign	mIoU	25.3	# 1
Open Vocabulary Semantic Segmentation	PASCAL Context-59	TaAlign(trained with image-text pairs)	mIoU	37.6	# 14
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	TagAlign	mIoU	37.6	# 1
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	TagAlign	mIoU	53.9	# 4
Open Vocabulary Semantic Segmentation	PascalVOC-20	TagAlign(trained with image-text pairs)	mIoU	87.9	# 10
Unsupervised Semantic Segmentation with Language-image Pre-training	PascalVOC-20	TagAlign	mIoU	87.9	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/unsupervised-semantic-segmentation-with-4)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-4?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/unsupervised-semantic-segmentation-with-9)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-9?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/unsupervised-semantic-segmentation-with-8)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-8?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/unsupervised-semantic-segmentation-with-7)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-7?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/unsupervised-semantic-segmentation-with-3)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-3?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/unsupervised-semantic-segmentation-with-10)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-10?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/unsupervised-semantic-segmentation-with-11)](https://paperswithcode.com/sota/unsupervised-semantic-segmentation-with-11?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/open-vocabulary-semantic-segmentation-on-5)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-5?p=tagalign-improving-vision-language-alignment)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tagalign-improving-vision-language-alignment/open-vocabulary-semantic-segmentation-on-1)](https://paperswithcode.com/sota/open-vocabulary-semantic-segmentation-on-1?p=tagalign-improving-vision-language-alignment)`

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

21 Dec 2023 · Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen ·

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

PDF Abstract

Code

Add Remove Mark official

Qinying-Liu/TagAlign official

Tasks

Add Remove

Attribute

Open Vocabulary Semantic Segmentation

Semantic Segmentation

TAG

Unsupervised Semantic Segmentation with Language-image Pre-training

Datasets

Cityscapes

ADE20K

PASCAL Context

COCO-Stuff

PASCAL VOC

ImageNet-S

Results from the Paper

Add Remove

Ranked #1 on Unsupervised Semantic Segmentation with Language-image Pre-training on COCO-Stuff-171

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training	ADE20K	TagAlign	Mean IoU (val)	17.3	# 1	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	Cityscapes val	TagAlign	mIoU	27.5	# 2	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Object	TagAlign	mIoU	33.3	# 3	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	COCO-Stuff-171	TagAlign	mIoU	25.3	# 1	Compare
Open Vocabulary Semantic Segmentation	PASCAL Context-59	TaAlign(trained with image-text pairs)	mIoU	37.6	# 14	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL Context-59	TagAlign	mIoU	37.6	# 1	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PASCAL VOC	TagAlign	mIoU	53.9	# 4	Compare
Open Vocabulary Semantic Segmentation	PascalVOC-20	TagAlign(trained with image-text pairs)	mIoU	87.9	# 10	Compare
Unsupervised Semantic Segmentation with Language-image Pre-training	PascalVOC-20	TagAlign	mIoU	87.9	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove