TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	DaViT-T	Validation mIoU	46.3	# 169
Semantic Segmentation	ADE20K	DaViT-B	Validation mIoU	49.4	# 125
Semantic Segmentation	ADE20K val	DaViT-B (UperNet)	mIoU	46.3	# 66
Semantic Segmentation	ADE20K val	DaViT-S (UperNet)	mIoU	48.8	# 57
Image Classification	ImageNet	DaViT-T	Top 1 Accuracy	82.8%	# 453
Image Classification	ImageNet	DaViT-T	Number of params	28.3M	# 639
Image Classification	ImageNet	DaViT-S	Number of params	49.7M	# 722
Image Classification	ImageNet	DaViT-H	Top 1 Accuracy	90.2%	# 13
Image Classification	ImageNet	DaViT-H	Number of params	362M	# 925
Image Classification	ImageNet	DaViT-H	GFLOPs	334	# 477
Image Classification	ImageNet	DaViT-L (ImageNet-22k)	Top 1 Accuracy	87.5%	# 86
Image Classification	ImageNet	DaViT-L (ImageNet-22k)	Number of params	196.8M	# 896
Image Classification	ImageNet	DaViT-L (ImageNet-22k)	GFLOPs	103	# 451
Image Classification	ImageNet	DaViT-B (ImageNet-22k)	Top 1 Accuracy	86.9%	# 115
Image Classification	ImageNet	DaViT-B (ImageNet-22k)	Number of params	87.9M	# 830
Image Classification	ImageNet	DaViT-B (ImageNet-22k)	GFLOPs	46.4	# 417
Image Classification	ImageNet	DaViT-B	Top 1 Accuracy	84.6%	# 288
Image Classification	ImageNet	DaViT-B	Number of params	87.9M	# 830
Image Classification	ImageNet	DaViT-B	GFLOPs	15.5	# 341
Medical Image Classification	ImageNet	DaViT-S	GFLOPs	8.8	# 2
Medical Image Classification	ImageNet	DaViT-S	Top 1 Accuracy	84.2%	# 1
Medical Image Classification	ImageNet	DaViT-T	GFLOPs	4.5	# 1
Image Classification	ImageNet	DaViT-G	Top 1 Accuracy	90.4%	# 12
Image Classification	ImageNet	DaViT-G	Number of params	1437M	# 958
Image Classification	ImageNet	DaViT-G	GFLOPs	1038	# 489
Object Detection	Object Detection on COCO minival	DaViT-T (Mask R-CNN, 36 epochs)	box AP	49.9	# 1
Instance Segmentation	Object Detection on COCO minival	DaViT-T (Mask R-CNN, 36 epochs)	mask AP	44.3	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/davit-dual-attention-vision-transformers/medical-image-classification-on-imagenet)](https://paperswithcode.com/sota/medical-image-classification-on-imagenet?p=davit-dual-attention-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/davit-dual-attention-vision-transformers/object-detection-on-object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-object-detection-on-coco?p=davit-dual-attention-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/davit-dual-attention-vision-transformers/instance-segmentation-on-object-detection-on)](https://paperswithcode.com/sota/instance-segmentation-on-object-detection-on?p=davit-dual-attention-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/davit-dual-attention-vision-transformers/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=davit-dual-attention-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/davit-dual-attention-vision-transformers/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=davit-dual-attention-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/davit-dual-attention-vision-transformers/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=davit-dual-attention-vision-transformers)`

DaViT: Dual Attention Vision Transformers

7 Apr 2022 · Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan ·

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

PDF Abstract

Code

Add Remove Mark official

dingmyu/davit official

285

rwightman/pytorch-image-models

29,774

leondgarse/keras_cv_attention_models

556

Tasks

Add Remove

Computational Efficiency

Image Classification

Instance Segmentation

Medical Image Classification

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #1 on Instance Segmentation on Object Detection on COCO minival

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	DaViT-T	Validation mIoU	46.3	# 169	Compare
Semantic Segmentation	ADE20K	DaViT-B	Validation mIoU	49.4	# 125	Compare
Semantic Segmentation	ADE20K val	DaViT-B (UperNet)	mIoU	46.3	# 66	Compare
Semantic Segmentation	ADE20K val	DaViT-S (UperNet)	mIoU	48.8	# 57	Compare
Image Classification	ImageNet	DaViT-T	Top 1 Accuracy	82.8%	# 453	Compare
Image Classification	ImageNet	DaViT-T	Number of params	28.3M	# 639	Compare
Image Classification	ImageNet	DaViT-S	Number of params	49.7M	# 722	Compare
Image Classification	ImageNet	DaViT-H	Top 1 Accuracy	90.2%	# 13	Compare
			Number of params	362M	# 925	Compare
			GFLOPs	334	# 477	Compare
Image Classification	ImageNet	DaViT-L (ImageNet-22k)	Top 1 Accuracy	87.5%	# 86	Compare
			Number of params	196.8M	# 896	Compare
			GFLOPs	103	# 451	Compare
Image Classification	ImageNet	DaViT-B (ImageNet-22k)	Top 1 Accuracy	86.9%	# 115	Compare
			Number of params	87.9M	# 830	Compare
			GFLOPs	46.4	# 417	Compare
Image Classification	ImageNet	DaViT-B	Top 1 Accuracy	84.6%	# 288	Compare
			Number of params	87.9M	# 830	Compare
			GFLOPs	15.5	# 341	Compare
Medical Image Classification	ImageNet	DaViT-S	GFLOPs	8.8	# 2	Compare
Medical Image Classification	ImageNet	DaViT-S	Top 1 Accuracy	84.2%	# 1	Compare
Medical Image Classification	ImageNet	DaViT-T	GFLOPs	4.5	# 1	Compare
Image Classification	ImageNet	DaViT-G	Top 1 Accuracy	90.4%	# 12	Compare
			Number of params	1437M	# 958	Compare
			GFLOPs	1038	# 489	Compare
Object Detection	Object Detection on COCO minival	DaViT-T (Mask R-CNN, 36 epochs)	box AP	49.9	# 1	Compare
Instance Segmentation	Object Detection on COCO minival	DaViT-T (Mask R-CNN, 36 epochs)	mask AP	44.3	# 1	Compare

Methods

Add Remove

Channel attention • Dense Connections • Layer Normalization • Linear Layer • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

DaViT: Dual Attention Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove