TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	ViT-B/16	Top 1 Accuracy	88.6%	# 44
Image Classification	ImageNet	ViT-B/16	Number of params	86M	# 814
Image Classification	ImageNet	ViT-L/16 (384res, distilled from ViT-22B)	Top 1 Accuracy	89.6%	# 24
Image Classification	ImageNet	ViT-L/16 (384res, distilled from ViT-22B)	Number of params	307M	# 915
Zero-Shot Transfer Image Classification	ImageNet	LiT-22B	Accuracy (Private)	85.9	# 4
Zero-Shot Transfer Image Classification	ImageNet-A	LiT-22B	Accuracy (Private)	90.1	# 2
Zero-Shot Transfer Image Classification	ImageNet-R	LiT-22B	Accuracy	96.0	# 4
Zero-Shot Transfer Image Classification	ImageNet V2	LiT-22B	Accuracy (Private)	80.9	# 2
Action Classification	Kinetics-400	ViT-22B	Acc@1	88.0	# 22
Zero-Shot Transfer Image Classification	ObjectNet	LiT-22B	Accuracy (Private)	87.6	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers-to-22-billion/zero-shot-transfer-image-classification-on-6)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-6?p=scaling-vision-transformers-to-22-billion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers-to-22-billion/zero-shot-transfer-image-classification-on-5)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=scaling-vision-transformers-to-22-billion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers-to-22-billion/zero-shot-transfer-image-classification-on-3)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=scaling-vision-transformers-to-22-billion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers-to-22-billion/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=scaling-vision-transformers-to-22-billion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers-to-22-billion/zero-shot-transfer-image-classification-on-4)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-4?p=scaling-vision-transformers-to-22-billion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers-to-22-billion/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=scaling-vision-transformers-to-22-billion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-vision-transformers-to-22-billion/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=scaling-vision-transformers-to-22-billion)`

Scaling Vision Transformers to 22 Billion Parameters

10 Feb 2023 · Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby ·

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

PDF Abstract

Code

Add Remove Mark official

lucidrains/flash-cosine-sim-attenti…

192

Tasks

Add Remove

Action Classification

Fairness

Image Classification

Linear-Probe Classification

Zero-Shot Transfer Image Classification

Datasets

ImageNet

CelebA

Kinetics

Places

ADE20K

Kinetics 400

ImageNet-R

Perceptual Similarity

ImageNet-A

ObjectNet

Results from the Paper

Add Remove

Ranked #1 on Zero-Shot Transfer Image Classification on ObjectNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	ViT-B/16	Top 1 Accuracy	88.6%	# 44	Compare
Image Classification	ImageNet	ViT-B/16	Number of params	86M	# 814	Compare
Image Classification	ImageNet	ViT-L/16 (384res, distilled from ViT-22B)	Top 1 Accuracy	89.6%	# 24	Compare
Image Classification	ImageNet	ViT-L/16 (384res, distilled from ViT-22B)	Number of params	307M	# 915	Compare
Zero-Shot Transfer Image Classification	ImageNet	LiT-22B	Accuracy (Private)	85.9	# 4	Compare
Zero-Shot Transfer Image Classification	ImageNet-A	LiT-22B	Accuracy (Private)	90.1	# 2	Compare
Zero-Shot Transfer Image Classification	ImageNet-R	LiT-22B	Accuracy	96.0	# 4	Compare
Zero-Shot Transfer Image Classification	ImageNet V2	LiT-22B	Accuracy (Private)	80.9	# 2	Compare
Action Classification	Kinetics-400	ViT-22B	Acc@1	88.0	# 22	Compare
Zero-Shot Transfer Image Classification	ObjectNet	LiT-22B	Accuracy (Private)	87.6	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Scaling Vision Transformers to 22 Billion Parameters

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove