TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	Cityscapes test	VLTSeg	Mean IoU (class)	86.4	# 1
Domain Generalization	GTA5-to-Cityscapes	VLTSeg	mIoU	65.6	# 1
Domain Generalization	GTA-to-Avg(Cityscapes,BDD,Mapillary)	VLTSeg	mIoU	63.5	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vltseg-simple-transfer-of-clip-based-vision/semantic-segmentation-on-cityscapes)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes?p=vltseg-simple-transfer-of-clip-based-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vltseg-simple-transfer-of-clip-based-vision/domain-generalization-on-gta5-to-cityscapes)](https://paperswithcode.com/sota/domain-generalization-on-gta5-to-cityscapes?p=vltseg-simple-transfer-of-clip-based-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vltseg-simple-transfer-of-clip-based-vision/domain-generalization-on-gta-to-avg)](https://paperswithcode.com/sota/domain-generalization-on-gta-to-avg?p=vltseg-simple-transfer-of-clip-based-vision)`

VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation

4 Dec 2023 · Christoph Hümmer, Manuel Schwonberg, Liangwei Zhou, Hu Cao, Alois Knoll, Hanno Gottschalk ·

Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNN), where domain shifts occur due to lighting, weather, or geolocation changes. In this work, we propose VLTSeg to enhance domain generalization in semantic segmentation, where the network is solely trained on the source domain and evaluated on unseen target domains. Our method leverages the inherent semantic robustness of vision-language models. First, by substituting traditional vision-only backbones with pre-trained encoders from CLIP and EVA-CLIP as transfer learning setting we find that in the field of DG, vision-language pre-training significantly outperforms supervised and self-supervised vision pre-training. We thus propose a new vision-language approach for domain generalized segmentation, which improves the domain generalization SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset. We further show the superior generalization capabilities of vision-language segmentation models by reaching 76.48% mIoU on the popular Cityscapes-to-ACDC benchmark, outperforming the previous SOTA approach by 6.9% mIoU on the test set at the time of writing. Additionally, our approach shows strong in-domain generalization capabilities indicated by 86.1% mIoU on the Cityscapes test set, resulting in a shared first place with the previous SOTA on the current leaderboard at the time of submission.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Domain Generalization

Segmentation

Semantic Segmentation

Transfer Learning

Vision-Language Segmentation

Datasets

Cityscapes

SYNTHIA

GTA5

LAION-5B

SA-1B

Results from the Paper

Edit

Ranked #1 on Semantic Segmentation on Cityscapes test (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	Cityscapes test	VLTSeg	Mean IoU (class)	86.4	# 1	Compare
Domain Generalization	GTA5-to-Cityscapes	VLTSeg	mIoU	65.6	# 1	Compare
Domain Generalization	GTA-to-Avg(Cityscapes,BDD,Mapillary)	VLTSeg	mIoU	63.5	# 2	Compare

Methods

Add Remove

CLIP

Edit Social Preview

VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove