BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 Jan 2023  ·  Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi ·

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Captioning COCO Captions BLIP-2 ViT-G OPT 2.7B (zero-shot) BLEU-4 43.7 # 4
CIDER 145.8 # 6
Image Captioning COCO Captions BLIP-2 ViT-G OPT 6.7B (zero-shot) BLEU-4 43.5 # 5
CIDER 145.2 # 8
Image Captioning COCO Captions BLIP-2 ViT-G FlanT5 XL (zero-shot) BLEU-4 42.4 # 8
CIDER 144.5 # 9
Image-to-Text Retrieval Flickr30k BLIP-2 ViT-L (zero-shot, 1K test set) Recall@1 96.9 # 5
Recall@5 100 # 1
Recall@10 100 # 1
Image Retrieval Flickr30k BLIP-2 ViT-G (zero-shot, 1K test set) Recall@5 98.1 # 1
Recall@10 98.9 # 1
Recall@1 89.7 # 1
Image-to-Text Retrieval Flickr30k BLIP-2 ViT-G (zero-shot, 1K test set) Recall@1 97.6 # 2
Recall@5 100 # 1
Recall@10 100 # 1
Image Retrieval Flickr30k BLIP-2 ViT-L (zero-shot, 1K test set) Recall@5 97.6 # 2
Recall@10 98.9 # 1
Recall@1 88.6 # 2
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 44.2 # 9
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 44.7 # 7
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 33.9 # 13
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 34.6 # 12
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 36.4 # 11
Visual Question Answering (VQA) GQA test-dev BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 44.4 # 8
Visual Question Answering (VQA) InfiMM-Eval BLIP-2-OPT2.7B Overall score 19.31 # 12
Deductive 2.76 # 14
Abductive 18.96 # 12
Analogical 7.5 # 12
Params 3B # 1
Visual Question Answering (VQA) InfoSeek BLIP2 Accuracy 14.6 # 6
visual instruction following LLaVA-Bench BLIP-2 avg score 38.1 # 7
Visual Question Answering MM-Vet BLIP-2-12B GPT-4 score 22.4±0.2 # 90
Params 12B # 1
Image Retrieval MS COCO BLIP-2 ViT-G (fine-tuned) Recall@10 92.6 # 3
recall@1 68.3 # 1
recall@5 87.7 # 2
Image-to-Text Retrieval MS COCO BLIP-2 ViT-L (fine-tuned) Recall@10 98.0 # 4
Recall@1 83.5 # 3
Recall@5 96.0 # 3
Image-to-Text Retrieval MS COCO BLIP-2 ViT-G (fine-tuned) Recall@10 98.5 # 2
Recall@1 85.4 # 1
Recall@5 97.0 # 1
Image Retrieval MS COCO BLIP-2 ViT-L (fine-tuned) Recall@10 91.8 # 4
recall@1 66.3 # 3
recall@5 86.5 # 3
Image Captioning nocaps-val-in-domain BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 123.7 # 1
SPICE 15.8 # 2
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-in-domain BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 123.7 # 1
SPICE 16.3 # 1
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-in-domain BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 123 # 3
SPICE 15.8 # 2
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-near-domain BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 117.8 # 3
SPICE 15.4 # 2
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-near-domain BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 120.2 # 1
SPICE 15.9 # 1
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-near-domain BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 119.2 # 2
SPICE 15.3 # 3
Pre-train (#images) 1.1B # 1
Image Captioning nocaps-val-out-domain BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 124.4 # 2
SPICE 14.8 # 3
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-out-domain BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 123.4 # 3
SPICE 15.1 # 1
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-out-domain BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 124.8 # 1
SPICE 15.1 # 1
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-overall BLIP-2 ViT-G OPT 6.7B (zero-shot) CIDEr 121.0 # 2
SPICE 15.3 # 3
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-overall BLIP-2 ViT-G FlanT5 XL (zero-shot) CIDEr 121.6 # 1
SPICE 15.8 # 1
Pretrain (#images) 1.1B # 1
Image Captioning nocaps-val-overall BLIP-2 ViT-G OPT 2.7B (zero-shot) CIDEr 119.7 # 3
SPICE 15.4 # 2
Pretrain (#images) 1.1B # 1
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 30.2 # 32
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 31.7 # 31
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 36.4 # 29
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 39.4 # 28
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 40.7 # 27
Visual Question Answering (VQA) OK-VQA BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 45.9 # 22
Open Vocabulary Attribute Detection OVAD-Box benchmark BLIP 2 (pretrained) mean average precision 25.5 # 2
Medical Visual Question Answering PMC-VQA BLIP-2 Accuracy 24.3 # 4
Generative Visual Question Answering PMC-VQA BLIP-2 BLEU-1 7.6 # 2
Visual Question Answering (VQA) PMC-VQA BLIP-2 Accuracy 24.3 # 4
Visual Question Answering VQA v2 test-dev BLIP-2 ViT-G OPT 2.7B (fine-tuned) Accuracy 81.74 # 4
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 65 # 42
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 49.7 # 54
Visual Question Answering VQA v2 test-dev BLIP-2 ViT-G OPT 6.7B (fine-tuned) Accuracy 82.30 # 1
Visual Question Answering VQA v2 test-dev BLIP-2 ViT-G FlanT5 XL (fine-tuned) Accuracy 81.66 # 5
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 52.3 # 51
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 52.6 # 50
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 62.3 # 48
Visual Question Answering (VQA) VQA v2 test-dev BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 63 # 47
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G OPT 2.7B (zero-shot) Accuracy 53.5 # 6
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G FlanT5 XXL (zero-shot) Accuracy 65.2 # 1
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G OPT 6.7B (zero-shot) Accuracy 54.3 # 5
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-G FlanT5 XL (zero-shot) Accuracy 63.1 # 3
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-L FlanT5 XL (zero-shot) Accuracy 62.6 # 4
Visual Question Answering VQA v2 val BLIP-2 ViT-G FlanT5 XL (fine-tuned) Accuracy 81.55 # 3
Visual Question Answering VQA v2 val BLIP-2 ViT-G OPT 2.7B (fine-tuned) Accuracy 81.59 # 2
Visual Question Answering VQA v2 val BLIP-2 ViT-G OPT 6.7B (fine-tuned) Accuracy 82.19 # 1
Visual Question Answering (VQA) VQA v2 val BLIP-2 ViT-L OPT 2.7B (zero-shot) Accuracy 50.1 # 7

Methods