Retrieval-Augmented Multimodal Language Modeling

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning MS COCO RA-CM3 (2.7B) CIDEr 89.1 # 8
Image Captioning MS COCO Vanilla CM3 CIDEr 71.9 # 12
Image Captioning MS COCO Flamingo (80B; 4-shot) CIDEr 103 # 7
Image Captioning MS COCO Flamingo (3B; 4-shot) CIDEr 85 # 9
Image Captioning MS COCO Parti CIDEr 83.9 # 10
Image Captioning MS COCO X-LXMERT CIDEr 55.8 # 13
Image Captioning MS COCO minDALL-E CIDEr 48 # 14
Image Captioning MS COCO ruDALL-E-XL CIDEr 38.7 # 15
Image Captioning MS COCO DALL-E CIDEr 20.2 # 16
Text-to-Image Generation MS COCO RA-CM3 (2.7B) FID 15.7 # 42
Text-to-Image Generation MS COCO Vanilla CM3 FID 29.5 # 59
Text-to-Image Generation MS COCO DALL-E (12B) FID 28 # 57
Text-to-Image Generation MS COCO Stable Diffusion FID 12.63 # 35

Methods