TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	HRViT-b1 (SegFormer, SS)	Validation mIoU	45.88	# 178
Semantic Segmentation	ADE20K	HRViT-b1 (SegFormer, SS)	Params (M)	8.2	# 60
Semantic Segmentation	ADE20K	HRViT-b1 (SegFormer, SS)	GFLOPs (512 x 512)	14.6	# 1
Semantic Segmentation	ADE20K	HRViT-b2 (SegFormer, SS)	Validation mIoU	48.76	# 136
Semantic Segmentation	ADE20K	HRViT-b2 (SegFormer, SS)	Params (M)	20.8	# 56
Semantic Segmentation	ADE20K	HRViT-b2 (SegFormer, SS)	GFLOPs (512 x 512)	28.0	# 3
Semantic Segmentation	ADE20K	HRViT-b3 (SegFormer, SS)	Validation mIoU	50.2	# 110
Semantic Segmentation	ADE20K	HRViT-b3 (SegFormer, SS)	Params (M)	28.7	# 53
Semantic Segmentation	ADE20K	HRViT-b3 (SegFormer, SS)	GFLOPs (512 x 512)	67.9	# 6
Semantic Segmentation	Cityscapes val	HRViT-b3 (SegFormer, SS)	mIoU	83.16%	# 24
Semantic Segmentation	Cityscapes val	HRViT-b1 (SegFormer, SS)	mIoU	81.63%	# 35
Semantic Segmentation	Cityscapes val	HRViT-b2 (SegFormer, SS)	mIoU	82.81%	# 26

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hrvit-multi-scale-high-resolution-vision/semantic-segmentation-on-cityscapes-val)](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=hrvit-multi-scale-high-resolution-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hrvit-multi-scale-high-resolution-vision/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=hrvit-multi-scale-high-resolution-vision)`

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

CVPR 2022 · Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, David Z. Pan ·

Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to convolutional neural network (CNN)-based models. However, ViTs are mainly designed for image classification that generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness. Those approaches enabled HRViT to push the Pareto frontier of performance and efficiency on semantic segmentation to a new level, as our evaluation results on ADE20K and Cityscapes show. HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction, demonstrating the potential of HRViT as a strong vision backbone for semantic segmentation.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

facebookresearch/HRViT official

174

Tasks

Add Remove

Image Classification

Representation Learning

Segmentation

Semantic Segmentation

Vocal Bursts Intensity Prediction

Datasets

ImageNet

Cityscapes

ADE20K

Results from the Paper

Edit

Ranked #24 on Semantic Segmentation on Cityscapes val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	HRViT-b1 (SegFormer, SS)	Validation mIoU	45.88	# 178	Compare
			Params (M)	8.2	# 60	Compare
			GFLOPs (512 x 512)	14.6	# 1	Compare
Semantic Segmentation	ADE20K	HRViT-b2 (SegFormer, SS)	Validation mIoU	48.76	# 136	Compare
			Params (M)	20.8	# 56	Compare
			GFLOPs (512 x 512)	28.0	# 3	Compare
Semantic Segmentation	ADE20K	HRViT-b3 (SegFormer, SS)	Validation mIoU	50.2	# 110	Compare
			Params (M)	28.7	# 53	Compare
			GFLOPs (512 x 512)	67.9	# 6	Compare
Semantic Segmentation	Cityscapes val	HRViT-b3 (SegFormer, SS)	mIoU	83.16%	# 24	Compare
Semantic Segmentation	Cityscapes val	HRViT-b1 (SegFormer, SS)	mIoU	81.63%	# 35	Compare
Semantic Segmentation	Cityscapes val	HRViT-b2 (SegFormer, SS)	mIoU	82.81%	# 26	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove