TASK
DATASET
MODEL
METRIC NAME
METRIC VALUE
GLOBAL RANK
EXTRA DATA
REMOVE
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
OFA Large
BLEU-4
0
# 6
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
OFA Large
CIDEr
0
# 6
Image-to-Text Retrieval
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Text-only FT)
Specificity
94
# 1
Image-to-Text Retrieval
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Fine-tuned)
Specificity
84
# 2
Image-to-Text Retrieval
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XL (Fine-tuned)
Specificity
81
# 3
Image-to-Text Retrieval
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Zero-shot)
Specificity
71
# 6
Image-to-Text Retrieval
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP Large
Specificity
77
# 4
Image-to-Text Retrieval
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
CoCa ViT-L-14 MSCOCO
Specificity
72
# 5
Image-to-Text Retrieval
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
CLIP ViT-L/14
Specificity
70
# 7
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Text-only FT)
Exact Match
4
# 6
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Text-only FT)
BEM
24
# 6
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Fine-tuned)
Exact Match
21
# 1
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Fine-tuned)
BEM
57
# 1
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XL (Fine-tuned)
Exact Match
20
# 2
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XL (Fine-tuned)
BEM
55
# 2
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Zero-shot)
Exact Match
15
# 3
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Zero-shot)
BEM
55
# 2
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP Large
Exact Match
6
# 5
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP Large
BEM
39
# 4
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
OFA Large
Exact Match
8
# 4
Visual Question Answering (VQA)
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
OFA Large
BEM
38
# 5
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Fine-tuned)
BLEU-4
42
# 1
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Fine-tuned)
CIDEr
177
# 1
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XL (Fine-tuned)
BLEU-4
41
# 2
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XL (Fine-tuned)
CIDEr
174
# 2
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Zero-Shot)
BLEU-4
31
# 3
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Zero-Shot)
CIDEr
120
# 3
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP Large
BLEU-4
13
# 5
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP Large
CIDEr
65
# 5
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
CoCa ViT-L-14 MSCOCO
BLEU-4
25
# 4
Image Captioning
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
CoCa ViT-L-14 MSCOCO
CIDEr
102
# 4
Explanation Generation
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Ground-truth Caption -> GPT3 (Oracle)
Human (%)
68
# 1
Explanation Generation
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Predicted Caption -> GPT3
Human (%)
33
# 2
Explanation Generation
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Fine-tuned)
Human (%)
27
# 3
Explanation Generation
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XL (Fine-tuned)
Human (%)
15
# 4
Explanation Generation
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
BLIP2 FlanT5-XXL (Zero-shot)
Human (%)
0
# 5