TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	LocalViT-TNT	Top 1 Accuracy	75.9%	# 857
Image Classification	ImageNet	LocalViT-TNT	Number of params	6.3M	# 441
Image Classification	ImageNet	LocalViT-TNT	GFLOPs	1.4	# 128
Image Classification	ImageNet	LocalViT-T2T	Top 1 Accuracy	72.5%	# 923
Image Classification	ImageNet	LocalViT-T2T	Number of params	4.3M	# 387
Image Classification	ImageNet	LocalViT-T2T	GFLOPs	1.2	# 114
Image Classification	ImageNet	LocalViT-T	Top 1 Accuracy	74.8%	# 896
Image Classification	ImageNet	LocalViT-T	Number of params	5.9M	# 434
Image Classification	ImageNet	LocalViT-T	GFLOPs	1.3	# 118
Image Classification	ImageNet	LocalViT-PVT	Top 1 Accuracy	78.2%	# 778
Image Classification	ImageNet	LocalViT-PVT	Number of params	13.5M	# 508
Image Classification	ImageNet	LocalViT-PVT	GFLOPs	4.8	# 226
Image Classification	ImageNet	LocalViT-S	Top 1 Accuracy	80.8%	# 623
Image Classification	ImageNet	LocalViT-S	Number of params	22.4M	# 568
Image Classification	ImageNet	LocalViT-S	GFLOPs	4.6	# 215

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/localvit-bringing-locality-to-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=localvit-bringing-locality-to-vision)`

LocalViT: Bringing Locality to Vision Transformers

12 Apr 2021 · Yawei Li, Kai Zhang, JieZhang Cao, Radu Timofte, Luc van Gool ·

We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at \url{https://github.com/ofsoundof/LocalViT}.

PDF Abstract

Code

Add Remove Mark official

ofsoundof/LocalViT official

107

rishikksh20/LocalViT-pytorch

Tasks

Add Remove

Image Classification

Datasets

ImageNet

Results from the Paper

Edit

Ranked #623 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	LocalViT-TNT	Top 1 Accuracy	75.9%	# 857	Compare
			Number of params	6.3M	# 441	Compare
			GFLOPs	1.4	# 128	Compare
Image Classification	ImageNet	LocalViT-T2T	Top 1 Accuracy	72.5%	# 923	Compare
			Number of params	4.3M	# 387	Compare
			GFLOPs	1.2	# 114	Compare
Image Classification	ImageNet	LocalViT-T	Top 1 Accuracy	74.8%	# 896	Compare
			Number of params	5.9M	# 434	Compare
			GFLOPs	1.3	# 118	Compare
Image Classification	ImageNet	LocalViT-PVT	Top 1 Accuracy	78.2%	# 778	Compare
			Number of params	13.5M	# 508	Compare
			GFLOPs	4.8	# 226	Compare
Image Classification	ImageNet	LocalViT-S	Top 1 Accuracy	80.8%	# 623	Compare
			Number of params	22.4M	# 568	Compare
			GFLOPs	4.6	# 215	Compare

Methods

Add Remove

Convolution • LocalViT

Edit Social Preview

LocalViT: Bringing Locality to Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove