TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	Winoground	X-VLM 16M	Text Score	46.7	# 13
Visual Reasoning	Winoground	X-VLM 16M	Image Score	24.5	# 38
Visual Reasoning	Winoground	X-VLM 16M	Group Score	21.2	# 27
Visual Reasoning	Winoground	BLIP-ViT/L 129M	Text Score	34.7	# 53
Visual Reasoning	Winoground	BLIP-ViT/L 129M	Image Score	14.5	# 71
Visual Reasoning	Winoground	BLIP-ViT/L 129M	Group Score	12.2	# 57
Visual Reasoning	Winoground	BLIP 129M (CapFilt/L)	Text Score	34.7	# 53
Visual Reasoning	Winoground	BLIP 129M (CapFilt/L)	Image Score	15.2	# 69
Visual Reasoning	Winoground	BLIP 129M (CapFilt/L)	Group Score	12.2	# 57
Visual Reasoning	Winoground	BLIP 129M	Text Score	35.5	# 50
Visual Reasoning	Winoground	BLIP 129M	Image Score	15.0	# 70
Visual Reasoning	Winoground	BLIP 129M	Group Score	11.7	# 61
Visual Reasoning	Winoground	PEVL 14M	Text Score	33.2	# 56
Visual Reasoning	Winoground	PEVL 14M	Image Score	15.7	# 66
Visual Reasoning	Winoground	PEVL 14M	Group Score	12.2	# 57
Visual Reasoning	Winoground	BLIP 14M	Text Score	36.5	# 45
Visual Reasoning	Winoground	BLIP 14M	Image Score	18.5	# 55
Visual Reasoning	Winoground	BLIP 14M	Group Score	14.5	# 46
Visual Reasoning	Winoground	ALBEF 14M	Text Score	32.5	# 57
Visual Reasoning	Winoground	ALBEF 14M	Image Score	16.2	# 62
Visual Reasoning	Winoground	ALBEF 14M	Group Score	12.7	# 53
Visual Reasoning	Winoground	X-VLM 4M	Text Score	44.0	# 20
Visual Reasoning	Winoground	X-VLM 4M	Image Score	26.7	# 27
Visual Reasoning	Winoground	X-VLM 4M	Group Score	21.5	# 25
Visual Reasoning	Winoground	ALBEF 4M	Text Score	29.2	# 73
Visual Reasoning	Winoground	ALBEF 4M	Image Score	15.5	# 67
Visual Reasoning	Winoground	ALBEF 4M	Group Score	11.0	# 64

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/measuring-progress-in-fine-grained-vision-and/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=measuring-progress-in-fine-grained-vision-and)`

Measuring Progress in Fine-grained Vision-and-Language Understanding

12 May 2023 · Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, Aida Nematzadeh ·

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

PDF Abstract

Code

Add Remove Mark official

e-bug/fine-grained-evals official

e-bug/weak-relation-vlm

Tasks

Add Remove

Visual Reasoning

Datasets

MS COCO

Flickr30k

RefCOCO

Objects365 Winoground

VSR

VALSE

Results from the Paper

Edit

Ranked #13 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	X-VLM 16M	Text Score	46.7	# 13	Compare
			Image Score	24.5	# 38	Compare
			Group Score	21.2	# 27	Compare
Visual Reasoning	Winoground	BLIP-ViT/L 129M	Text Score	34.7	# 53	Compare
			Image Score	14.5	# 71	Compare
			Group Score	12.2	# 57	Compare
Visual Reasoning	Winoground	BLIP 129M (CapFilt/L)	Text Score	34.7	# 53	Compare
			Image Score	15.2	# 69	Compare
			Group Score	12.2	# 57	Compare
Visual Reasoning	Winoground	BLIP 129M	Text Score	35.5	# 50	Compare
			Image Score	15.0	# 70	Compare
			Group Score	11.7	# 61	Compare
Visual Reasoning	Winoground	PEVL 14M	Text Score	33.2	# 56	Compare
			Image Score	15.7	# 66	Compare
			Group Score	12.2	# 57	Compare
Visual Reasoning	Winoground	BLIP 14M	Text Score	36.5	# 45	Compare
			Image Score	18.5	# 55	Compare
			Group Score	14.5	# 46	Compare
Visual Reasoning	Winoground	ALBEF 14M	Text Score	32.5	# 57	Compare
			Image Score	16.2	# 62	Compare
			Group Score	12.7	# 53	Compare
Visual Reasoning	Winoground	X-VLM 4M	Text Score	44.0	# 20	Compare
			Image Score	26.7	# 27	Compare
			Group Score	21.5	# 25	Compare
Visual Reasoning	Winoground	ALBEF 4M	Text Score	29.2	# 73	Compare
			Image Score	15.5	# 67	Compare
			Group Score	11.0	# 64	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Measuring Progress in Fine-grained Vision-and-Language Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove