TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K val	SegViT ViT-Large	mIoU	55.2	# 29
Semantic Segmentation	COCO-Stuff test	SegViT (ours)	mIoU	50.3%	# 4
Semantic Segmentation	PASCAL Context	SegViT (ours)	mIoU	65.3	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/segvit-semantic-segmentation-with-plain/semantic-segmentation-on-coco-stuff-test)](https://paperswithcode.com/sota/semantic-segmentation-on-coco-stuff-test?p=segvit-semantic-segmentation-with-plain)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/segvit-semantic-segmentation-with-plain/semantic-segmentation-on-pascal-context)](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=segvit-semantic-segmentation-with-plain)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/segvit-semantic-segmentation-with-plain/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=segvit-semantic-segmentation-with-plain)`

SegViT: Semantic Segmentation with Plain Vision Transformers

12 Oct 2022 · BoWen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, Yifan Liu ·

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.

PDF Abstract

Code

Add Remove Mark official

zbwxp/SegVit official

176

Tasks

Add Remove

Segmentation

Semantic Segmentation

Datasets

ADE20K

PASCAL Context

COCO-Stuff

Results from the Paper

Edit

Ranked #4 on Semantic Segmentation on COCO-Stuff test

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K val	SegViT ViT-Large	mIoU	55.2	# 29	Compare
Semantic Segmentation	COCO-Stuff test	SegViT (ours)	mIoU	50.3%	# 4	Compare
Semantic Segmentation	PASCAL Context	SegViT (ours)	mIoU	65.3	# 7	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

SegViT: Semantic Segmentation with Plain Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove