TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	GC ViT-B	Validation mIoU	49	# 132
Semantic Segmentation	ADE20K	GC ViT-B	Params (M)	125	# 23
Semantic Segmentation	ADE20K	GC ViT-B	GFLOPs (512 x 512)	1348	# 22
Semantic Segmentation	ADE20K	GC ViT-S	Validation mIoU	48.3	# 141
Semantic Segmentation	ADE20K	GC ViT-S	Params (M)	84	# 32
Semantic Segmentation	ADE20K	GC ViT-S	GFLOPs (512 x 512)	1163	# 19
Semantic Segmentation	ADE20K	GC ViT-T	Validation mIoU	46.5	# 167
Semantic Segmentation	ADE20K	GC ViT-T	Params (M)	58	# 44
Semantic Segmentation	ADE20K	GC ViT-T	GFLOPs (512 x 512)	947	# 14
Image Classification	ImageNet	GC ViT-T	Top 1 Accuracy	83.4%	# 394
Image Classification	ImageNet	GC ViT-T	Number of params	28M	# 629
Image Classification	ImageNet	GC ViT-T	GFLOPs	4.7	# 220
Image Classification	ImageNet	GC ViT-XT	Top 1 Accuracy	82.0%	# 530
Image Classification	ImageNet	GC ViT-XT	Number of params	20M	# 536
Image Classification	ImageNet	GC ViT-XT	GFLOPs	2.6	# 164
Image Classification	ImageNet	GC ViT-XXT	Top 1 Accuracy	79.8%	# 676
Image Classification	ImageNet	GC ViT-XXT	Number of params	12M	# 496
Image Classification	ImageNet	GC ViT-XXT	GFLOPs	2.1	# 151
Image Classification	ImageNet	GC ViT-S	Top 1 Accuracy	84.0%	# 336
Image Classification	ImageNet	GC ViT-S	Number of params	51M	# 729
Image Classification	ImageNet	GC ViT-S	GFLOPs	8.5	# 278
Image Classification	ImageNet	GC ViT-B	Top 1 Accuracy	84.5%	# 293
Image Classification	ImageNet	GC ViT-B	Number of params	90M	# 847
Image Classification	ImageNet	GC ViT-B	GFLOPs	14.8	# 336

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/global-context-vision-transformers/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=global-context-vision-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/global-context-vision-transformers/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=global-context-vision-transformers)`

Global Context Vision Transformers

20 Jun 2022 · Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov ·

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO detection head achieves a box AP of 58.3 on MS COCO dataset.

PDF Abstract

Code

Add Remove Mark official

nvlabs/gcvit official

↳ Quickstart in

Colab

Spaces

414

rwightman/pytorch-image-models

29,826

open-mmlab/mmclassification

3,169

leondgarse/keras_cv_attention_models

559

awsaf49/gcvit-tf

↳ Quickstart in

Colab

Spaces

See all 8 implementations

Tasks

Add Remove

Image Classification

Inductive Bias

Instance Segmentation

Object Detection

Segmentation

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #132 on Semantic Segmentation on ADE20K

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	GC ViT-B	Validation mIoU	49	# 132	Compare
			Params (M)	125	# 23	Compare
			GFLOPs (512 x 512)	1348	# 22	Compare
Semantic Segmentation	ADE20K	GC ViT-S	Validation mIoU	48.3	# 141	Compare
			Params (M)	84	# 32	Compare
			GFLOPs (512 x 512)	1163	# 19	Compare
Semantic Segmentation	ADE20K	GC ViT-T	Validation mIoU	46.5	# 167	Compare
			Params (M)	58	# 44	Compare
			GFLOPs (512 x 512)	947	# 14	Compare
Image Classification	ImageNet	GC ViT-T	Top 1 Accuracy	83.4%	# 394	Compare
			Number of params	28M	# 629	Compare
			GFLOPs	4.7	# 220	Compare
Image Classification	ImageNet	GC ViT-XT	Top 1 Accuracy	82.0%	# 530	Compare
			Number of params	20M	# 536	Compare
			GFLOPs	2.6	# 164	Compare
Image Classification	ImageNet	GC ViT-XXT	Top 1 Accuracy	79.8%	# 676	Compare
			Number of params	12M	# 496	Compare
			GFLOPs	2.1	# 151	Compare
Image Classification	ImageNet	GC ViT-S	Top 1 Accuracy	84.0%	# 336	Compare
			Number of params	51M	# 729	Compare
			GFLOPs	8.5	# 278	Compare
Image Classification	ImageNet	GC ViT-B	Top 1 Accuracy	84.5%	# 293	Compare
			Number of params	90M	# 847	Compare
			GFLOPs	14.8	# 336	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BASE • BPE • ConvNeXt • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Stochastic Depth • Swin Transformer • Transformer • Vision Transformer

Edit Social Preview

Global Context Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove