Global Context Vision Transformers

20 Jun 2022  ·  Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov ·

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO detection head achieves a box AP of 58.3 on MS COCO dataset.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Semantic Segmentation ADE20K GC ViT-B Validation mIoU 49 # 132
Params (M) 125 # 23
GFLOPs (512 x 512) 1348 # 22
Semantic Segmentation ADE20K GC ViT-S Validation mIoU 48.3 # 141
Params (M) 84 # 32
GFLOPs (512 x 512) 1163 # 19
Semantic Segmentation ADE20K GC ViT-T Validation mIoU 46.5 # 167
Params (M) 58 # 44
GFLOPs (512 x 512) 947 # 14
Image Classification ImageNet GC ViT-T Top 1 Accuracy 83.4% # 394
Number of params 28M # 629
GFLOPs 4.7 # 220
Image Classification ImageNet GC ViT-XT Top 1 Accuracy 82.0% # 530
Number of params 20M # 536
GFLOPs 2.6 # 164
Image Classification ImageNet GC ViT-XXT Top 1 Accuracy 79.8% # 676
Number of params 12M # 496
GFLOPs 2.1 # 151
Image Classification ImageNet GC ViT-S Top 1 Accuracy 84.0% # 336
Number of params 51M # 729
GFLOPs 8.5 # 278
Image Classification ImageNet GC ViT-B Top 1 Accuracy 84.5% # 293
Number of params 90M # 847
GFLOPs 14.8 # 336

Methods