TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	MS COCO	FLAVA (zero-shot)	recall@1	38.38	# 4
Image Retrieval	MS COCO	FLAVA (zero-shot)	recall@5	67.47	# 4
Image-to-Text Retrieval	MS COCO	FLAVA (zero-shot)	Recall@1	42.74	# 5
Image-to-Text Retrieval	MS COCO	FLAVA (zero-shot)	Recall@5	76.76	# 5
Image-to-Text Retrieval	MS COCO	CLIP (zero-shot)	Recall@1	37.12	# 6
Image-to-Text Retrieval	MS COCO	CLIP (zero-shot)	Recall@5	69.48	# 6
Image Retrieval	MS COCO	CLIP (zero-shot)	recall@1	33.29	# 5
Image Retrieval	MS COCO	CLIP (zero-shot)	recall@5	62.47	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/flava-a-foundational-language-and-vision/image-retrieval-on-coco)](https://paperswithcode.com/sota/image-retrieval-on-coco?p=flava-a-foundational-language-and-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/flava-a-foundational-language-and-vision/image-to-text-retrieval-on-coco)](https://paperswithcode.com/sota/image-to-text-retrieval-on-coco?p=flava-a-foundational-language-and-vision)`

FLAVA: A Foundational Language And Vision Alignment Model

CVPR 2022 · Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela ·

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

facebookresearch/multimodal

1,294

apsdehal/flava-tutorials

↳ Quickstart in

Colab

social-ai-studio/matk

Tasks

Add Remove

Image Retrieval

Image-to-Text Retrieval

Visual Reasoning

Zero-shot Image Retrieval

Zero-shot Text Retrieval

Datasets

MS COCO

GLUE

SST

MultiNLI SST-2

Visual Genome

QNLI ImageNet-1K

MRPC

CoLA

BookCorpus

Conceptual Captions

YFCC100M

Hateful Memes SNLI-VE

Localized Narratives

RedCaps

Results from the Paper

Edit

Ranked #4 on Image Retrieval on MS COCO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	MS COCO	FLAVA (zero-shot)	recall@1	38.38	# 4	Compare
Image Retrieval	MS COCO	FLAVA (zero-shot)	recall@5	67.47	# 4	Compare
Image-to-Text Retrieval	MS COCO	FLAVA (zero-shot)	Recall@1	42.74	# 5	Compare
Image-to-Text Retrieval	MS COCO	FLAVA (zero-shot)	Recall@5	76.76	# 5	Compare
Image-to-Text Retrieval	MS COCO	CLIP (zero-shot)	Recall@1	37.12	# 6	Compare
Image-to-Text Retrieval	MS COCO	CLIP (zero-shot)	Recall@5	69.48	# 6	Compare
Image Retrieval	MS COCO	CLIP (zero-shot)	recall@1	33.29	# 5	Compare
Image Retrieval	MS COCO	CLIP (zero-shot)	recall@5	62.47	# 5	Compare

Methods

Add Remove

FLAVA

Edit Social Preview

FLAVA: A Foundational Language And Vision Alignment Model

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove