TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@1	77	# 14
Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@10	96.9	# 12
Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@5	93.5	# 14
Cross-Modal Retrieval	COCO 2014	ALIGN	Text-to-image R@1	59.9	# 16
Cross-Modal Retrieval	COCO 2014	ALIGN	Text-to-image R@10	89.8	# 16
Cross-Modal Retrieval	COCO 2014	ALIGN	Text-to-image R@5	83.3	# 16
Zero-Shot Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@1	58.6	# 12
Zero-Shot Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@5	83.0	# 12
Zero-Shot Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@10	89.7	# 11
Zero-Shot Cross-Modal Retrieval	COCO 2014	ALIGN	Text-to-image R@1	45.6	# 12
Zero-Shot Cross-Modal Retrieval	COCO 2014	ALIGN	Text-to-image R@5	69.8	# 13
Zero-Shot Cross-Modal Retrieval	COCO 2014	ALIGN	Text-to-image R@10	78.6	# 12
Zero-Shot Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@1	88.6	# 12
Zero-Shot Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@5	98.7	# 13
Zero-Shot Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@10	99.7	# 7
Zero-Shot Cross-Modal Retrieval	Flickr30k	ALIGN	Text-to-image R@1	75.7	# 13
Zero-Shot Cross-Modal Retrieval	Flickr30k	ALIGN	Text-to-image R@5	93.8	# 10
Zero-Shot Cross-Modal Retrieval	Flickr30k	ALIGN	Text-to-image R@10	96.8	# 9
Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@1	95.3	# 9
Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@10	100	# 1
Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@5	99.8	# 8
Cross-Modal Retrieval	Flickr30k	ALIGN	Text-to-image R@1	84.9	# 10
Cross-Modal Retrieval	Flickr30k	ALIGN	Text-to-image R@10	98.6	# 10
Cross-Modal Retrieval	Flickr30k	ALIGN	Text-to-image R@5	97.4	# 8
Image Classification	Flowers-102	ALIGN	Accuracy	99.65%	# 5
Fine-Grained Image Classification	Food-101	ALIGN	Accuracy	95.88	# 3
Image Classification	ImageNet	ALIGN (EfficientNet-L2)	Top 1 Accuracy	88.64%	# 42
Image Classification	ImageNet	ALIGN (EfficientNet-L2)	Number of params	480M	# 935
Image Classification	ImageNet	ALIGN (EfficientNet-L2)	Hardware Burden	None	# 1
Image Classification	ImageNet	ALIGN (EfficientNet-L2)	Operations per network pass	None	# 1
Zero-Shot Transfer Image Classification	ImageNet	ALIGN	Accuracy (Private)	76.4	# 17
Zero-Shot Transfer Image Classification	ImageNet	ALIGN	Accuracy (Public)	-	# 4
Zero-Shot Transfer Image Classification	ImageNet-A	ALIGN	Accuracy (Private)	75.8	# 11
Zero-Shot Transfer Image Classification	ImageNet-A	ALIGN	Accuracy (Public)	-	# 2
Zero-Shot Transfer Image Classification	ImageNet-R	ALIGN	Accuracy	92.2	# 9
Zero-Shot Transfer Image Classification	ImageNet V2	ALIGN	Accuracy (Private)	70.1	# 10
Zero-Shot Transfer Image Classification	ImageNet V2	ALIGN	Accuracy (Public)	-	# 2
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	ALIGN	Accuracy	96.19%	# 4
Fine-Grained Image Classification	Stanford Cars	ALIGN	Accuracy	96.13%	# 4
Image Classification	VTAB-1k	ALIGN (50 hypers/task)	Top-1 Accuracy	79.99	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/image-classification-on-vtab-1k-1)](https://paperswithcode.com/sota/image-classification-on-vtab-1k-1?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/fine-grained-image-classification-on-food-101)](https://paperswithcode.com/sota/fine-grained-image-classification-on-food-101?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/fine-grained-image-classification-on-oxford-1)](https://paperswithcode.com/sota/fine-grained-image-classification-on-oxford-1?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/fine-grained-image-classification-on-stanford)](https://paperswithcode.com/sota/fine-grained-image-classification-on-stanford?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/image-classification-on-flowers-102)](https://paperswithcode.com/sota/image-classification-on-flowers-102?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/cross-modal-retrieval-on-flickr30k?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/zero-shot-transfer-image-classification-on-4)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-4?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/zero-shot-transfer-image-classification-on-3)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/zero-shot-transfer-image-classification-on-5)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/cross-modal-retrieval-on-coco-2014?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=scaling-up-visual-and-vision-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scaling-up-visual-and-vision-language/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=scaling-up-visual-and-vision-language)`

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

11 Feb 2021 · Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, YunHsuan Sung, Zhen Li, Tom Duerig ·

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

PDF Abstract

Code

Add Remove Mark official

kakaobrain/coyo-dataset

1,069

facebookresearch/metaclip

↳ Quickstart in

Colab

Spaces

991

MicPie/clasp

143

willard-yuan/video-text-retrieval-p…

Tasks

Add Remove

Cross-Modal Retrieval

Fine-Grained Image Classification

Image Classification

Representation Learning

Retrieval

Text Retrieval

Zero-Shot Cross-Modal Retrieval

Zero-Shot Image Classification

Zero-Shot Transfer Image Classification

Datasets

ImageNet

MS COCO

Oxford 102 Flower

Flickr30k

Stanford Cars

Food-101

ImageNet-R

ImageNet-A

Conceptual Captions Multi30K

JFT-300M

Oxford-IIIT Pet Dataset Oxford-IIIT Pets

CxC

Results from the Paper

Edit

Ranked #1 on Image Classification on VTAB-1k (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@1	77	# 14	Compare
			Image-to-text R@10	96.9	# 12	Compare
			Image-to-text R@5	93.5	# 14	Compare
			Text-to-image R@1	59.9	# 16	Compare
			Text-to-image R@10	89.8	# 16	Compare
			Text-to-image R@5	83.3	# 16	Compare
Zero-Shot Cross-Modal Retrieval	COCO 2014	ALIGN	Image-to-text R@1	58.6	# 12	Compare
			Image-to-text R@5	83.0	# 12	Compare
			Image-to-text R@10	89.7	# 11	Compare
			Text-to-image R@1	45.6	# 12	Compare
			Text-to-image R@5	69.8	# 13	Compare
			Text-to-image R@10	78.6	# 12	Compare
Zero-Shot Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@1	88.6	# 12	Compare
			Image-to-text R@5	98.7	# 13	Compare
			Image-to-text R@10	99.7	# 7	Compare
			Text-to-image R@1	75.7	# 13	Compare
			Text-to-image R@5	93.8	# 10	Compare
			Text-to-image R@10	96.8	# 9	Compare
Cross-Modal Retrieval	Flickr30k	ALIGN	Image-to-text R@1	95.3	# 9	Compare
			Image-to-text R@10	100	# 1	Compare
			Image-to-text R@5	99.8	# 8	Compare
			Text-to-image R@1	84.9	# 10	Compare
			Text-to-image R@10	98.6	# 10	Compare
			Text-to-image R@5	97.4	# 8	Compare
Image Classification	Flowers-102	ALIGN	Accuracy	99.65%	# 5	Compare
Fine-Grained Image Classification	Food-101	ALIGN	Accuracy	95.88	# 3	Compare
Image Classification	ImageNet	ALIGN (EfficientNet-L2)	Top 1 Accuracy	88.64%	# 42	Compare
			Number of params	480M	# 935	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
Zero-Shot Transfer Image Classification	ImageNet	ALIGN	Accuracy (Private)	76.4	# 17	Compare
Zero-Shot Transfer Image Classification	ImageNet	ALIGN	Accuracy (Public)	-	# 4	Compare
Zero-Shot Transfer Image Classification	ImageNet-A	ALIGN	Accuracy (Private)	75.8	# 11	Compare
Zero-Shot Transfer Image Classification	ImageNet-A	ALIGN	Accuracy (Public)	-	# 2	Compare
Zero-Shot Transfer Image Classification	ImageNet-R	ALIGN	Accuracy	92.2	# 9	Compare
Zero-Shot Transfer Image Classification	ImageNet V2	ALIGN	Accuracy (Private)	70.1	# 10	Compare
Zero-Shot Transfer Image Classification	ImageNet V2	ALIGN	Accuracy (Public)	-	# 2	Compare
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	ALIGN	Accuracy	96.19%	# 4	Compare
Fine-Grained Image Classification	Stanford Cars	ALIGN	Accuracy	96.13%	# 4	Compare
Image Classification	VTAB-1k	ALIGN (50 hypers/task)	Top-1 Accuracy	79.99	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove