TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	DPT-Hybrid	Validation mIoU	49.02	# 131
Semantic Segmentation	ADE20K val	DPT-Hybrid	mIoU	49.02	# 56
Semantic Segmentation	ADE20K val	DPT-Hybrid	Pixel Accuracy	83.11	# 4
Monocular Depth Estimation	KITTI Eigen split	DPT-Hybrid	absolute relative error	0.062	# 32
Monocular Depth Estimation	KITTI Eigen split	DPT-Hybrid	RMSE	2.573	# 31
Monocular Depth Estimation	KITTI Eigen split	DPT-Hybrid	RMSE log	0.092	# 30
Monocular Depth Estimation	KITTI Eigen split	DPT-Hybrid	Delta < 1.25	0.959	# 31
Monocular Depth Estimation	KITTI Eigen split	DPT-Hybrid	Delta < 1.25^2	0.995	# 27
Monocular Depth Estimation	KITTI Eigen split	DPT-Hybrid	Delta < 1.25^3	0.999	# 10
Monocular Depth Estimation	NYU-Depth V2	DPT-Hybrid	RMSE	0.357	# 39
Monocular Depth Estimation	NYU-Depth V2	DPT-Hybrid	absolute relative error	0.110	# 44
Monocular Depth Estimation	NYU-Depth V2	DPT-Hybrid	Delta < 1.25	0.904	# 40
Monocular Depth Estimation	NYU-Depth V2	DPT-Hybrid	Delta < 1.25^2	0.988	# 27
Monocular Depth Estimation	NYU-Depth V2	DPT-Hybrid	Delta < 1.25^3	0.994	# 44
Monocular Depth Estimation	NYU-Depth V2	DPT-Hybrid	log 10	0.045	# 42
Semantic Segmentation	PASCAL Context	DPT-Hybrid	mIoU	60.46	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformers-for-dense-prediction/semantic-segmentation-on-pascal-context)](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=vision-transformers-for-dense-prediction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformers-for-dense-prediction/monocular-depth-estimation-on-kitti-eigen)](https://paperswithcode.com/sota/monocular-depth-estimation-on-kitti-eigen?p=vision-transformers-for-dense-prediction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformers-for-dense-prediction/monocular-depth-estimation-on-nyu-depth-v2)](https://paperswithcode.com/sota/monocular-depth-estimation-on-nyu-depth-v2?p=vision-transformers-for-dense-prediction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformers-for-dense-prediction/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=vision-transformers-for-dense-prediction)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformers-for-dense-prediction/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=vision-transformers-for-dense-prediction)`

Vision Transformers for Dense Prediction

ICCV 2021 · René Ranftl, Alexey Bochkovskiy, Vladlen Koltun ·

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

isl-org/DPT official

1,852

huggingface/transformers

124,593

intel-isl/MiDaS

↳ Quickstart in

Colab

Spaces

PyTorch Hub

4,072

isl-org/MiDaS

↳ Quickstart in

Spaces

4,070

kritiksoman/GIMP-ML

1,360

See all 15 implementations

Tasks

Add Remove

Depth Estimation

Monocular Depth Estimation

Semantic Segmentation

Datasets

KITTI

ADE20K

NYUv2

PASCAL Context

WSVD

Results from the Paper

Add Remove

Ranked #12 on Semantic Segmentation on PASCAL Context

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	DPT-Hybrid	Validation mIoU	49.02	# 131	Compare
Semantic Segmentation	ADE20K val	DPT-Hybrid	mIoU	49.02	# 56	Compare
Semantic Segmentation	ADE20K val	DPT-Hybrid	Pixel Accuracy	83.11	# 4	Compare
Monocular Depth Estimation	KITTI Eigen split	DPT-Hybrid	absolute relative error	0.062	# 32	Compare
			RMSE	2.573	# 31	Compare
			RMSE log	0.092	# 30	Compare
			Delta < 1.25	0.959	# 31	Compare
			Delta < 1.25^2	0.995	# 27	Compare
			Delta < 1.25^3	0.999	# 10	Compare
Monocular Depth Estimation	NYU-Depth V2	DPT-Hybrid	RMSE	0.357	# 39	Compare
			absolute relative error	0.110	# 44	Compare
			Delta < 1.25	0.904	# 40	Compare
			Delta < 1.25^2	0.988	# 27	Compare
			Delta < 1.25^3	0.994	# 44	Compare
			log 10	0.045	# 42	Compare
Semantic Segmentation	PASCAL Context	DPT-Hybrid	mIoU	60.46	# 12	Compare

Methods

Add Remove

Convolution • Dense Connections • Dot-Product Attention • DPT • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

Vision Transformers for Dense Prediction

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove