LocalViT: Bringing Locality to Vision Transformers

12 Apr 2021  ·  Yawei Li, Kai Zhang, JieZhang Cao, Radu Timofte, Luc van Gool ·

We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at \url{https://github.com/ofsoundof/LocalViT}.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Classification ImageNet LocalViT-TNT Top 1 Accuracy 75.9% # 857
Number of params 6.3M # 441
GFLOPs 1.4 # 128
Image Classification ImageNet LocalViT-T2T Top 1 Accuracy 72.5% # 923
Number of params 4.3M # 387
GFLOPs 1.2 # 114
Image Classification ImageNet LocalViT-T Top 1 Accuracy 74.8% # 896
Number of params 5.9M # 434
GFLOPs 1.3 # 118
Image Classification ImageNet LocalViT-PVT Top 1 Accuracy 78.2% # 778
Number of params 13.5M # 508
GFLOPs 4.8 # 226
Image Classification ImageNet LocalViT-S Top 1 Accuracy 80.8% # 623
Number of params 22.4M # 568
GFLOPs 4.6 # 215

Methods