TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	MS COCO	RA-CM3 (2.7B)	CIDEr	89.1	# 8
Image Captioning	MS COCO	Vanilla CM3	CIDEr	71.9	# 12
Image Captioning	MS COCO	Flamingo (80B; 4-shot)	CIDEr	103	# 7
Image Captioning	MS COCO	Flamingo (3B; 4-shot)	CIDEr	85	# 9
Image Captioning	MS COCO	Parti	CIDEr	83.9	# 10
Image Captioning	MS COCO	X-LXMERT	CIDEr	55.8	# 13
Image Captioning	MS COCO	minDALL-E	CIDEr	48	# 14
Image Captioning	MS COCO	ruDALL-E-XL	CIDEr	38.7	# 15
Image Captioning	MS COCO	DALL-E	CIDEr	20.2	# 16
Text-to-Image Generation	MS COCO	RA-CM3 (2.7B)	FID	15.7	# 42
Text-to-Image Generation	MS COCO	Vanilla CM3	FID	29.5	# 59
Text-to-Image Generation	MS COCO	DALL-E (12B)	FID	28	# 57
Text-to-Image Generation	MS COCO	Stable Diffusion	FID	12.63	# 35

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/retrieval-augmented-multimodal-language/image-captioning-on-coco)](https://paperswithcode.com/sota/image-captioning-on-coco?p=retrieval-augmented-multimodal-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/retrieval-augmented-multimodal-language/text-to-image-generation-on-coco)](https://paperswithcode.com/sota/text-to-image-generation-on-coco?p=retrieval-augmented-multimodal-language)`

Retrieval-Augmented Multimodal Language Modeling

22 Nov 2022 · Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih ·

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Caption Generation

Image Captioning

Image Generation

In-Context Learning

Language Modelling

Retrieval

Text Generation

Text-to-Image Generation

Datasets

MS COCO

Results from the Paper

Edit

Ranked #7 on Image Captioning on MS COCO

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	MS COCO	RA-CM3 (2.7B)	CIDEr	89.1	# 8	Compare
Image Captioning	MS COCO	Vanilla CM3	CIDEr	71.9	# 12	Compare
Image Captioning	MS COCO	Flamingo (80B; 4-shot)	CIDEr	103	# 7	Compare
Image Captioning	MS COCO	Flamingo (3B; 4-shot)	CIDEr	85	# 9	Compare
Image Captioning	MS COCO	Parti	CIDEr	83.9	# 10	Compare
Image Captioning	MS COCO	X-LXMERT	CIDEr	55.8	# 13	Compare
Image Captioning	MS COCO	minDALL-E	CIDEr	48	# 14	Compare
Image Captioning	MS COCO	ruDALL-E-XL	CIDEr	38.7	# 15	Compare
Image Captioning	MS COCO	DALL-E	CIDEr	20.2	# 16	Compare
Text-to-Image Generation	MS COCO	RA-CM3 (2.7B)	FID	15.7	# 42	Compare
Text-to-Image Generation	MS COCO	Vanilla CM3	FID	29.5	# 59	Compare
Text-to-Image Generation	MS COCO	DALL-E (12B)	FID	28	# 57	Compare
Text-to-Image Generation	MS COCO	Stable Diffusion	FID	12.63	# 35	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BASE • BPE • CLIP • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Retrieval-Augmented Multimodal Language Modeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove