TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	Winoground	KeyComp* (GPT-4)	Text Score	43.5	# 22
Visual Reasoning	Winoground	KeyComp* (GPT-4)	Image Score	28.7	# 21
Visual Reasoning	Winoground	KeyComp* (GPT-4)	Group Score	18.2	# 36
Visual Reasoning	Winoground	KeyComp* (GPT-3.5)	Text Score	42.7	# 26
Visual Reasoning	Winoground	KeyComp* (GPT-3.5)	Image Score	27.8	# 23
Visual Reasoning	Winoground	KeyComp* (GPT-3.5)	Group Score	17.4	# 38
Visual Reasoning	Winoground	KeyComp (GPT-3.5)	Text Score	30.3	# 66
Visual Reasoning	Winoground	KeyComp (GPT-3.5)	Image Score	24.6	# 37
Visual Reasoning	Winoground	KeyComp (GPT-3.5)	Group Score	12.4	# 54

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/prompting-large-vision-language-models-for/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=prompting-large-vision-language-models-for)`

Prompting Large Vision-Language Models for Compositional Reasoning

20 Jan 2024 · Timothy Ossowski, Ming Jiang, Junjie Hu ·

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.

PDF Abstract

Code

Add Remove Mark official

tossowski/keycomp official

Tasks

Add Remove

Retrieval

Visual Reasoning

Datasets

Visual Question Answering Winoground

Results from the Paper

Add Remove

Ranked #22 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	KeyComp* (GPT-4)	Text Score	43.5	# 22	Compare
			Image Score	28.7	# 21	Compare
			Group Score	18.2	# 36	Compare
Visual Reasoning	Winoground	KeyComp* (GPT-3.5)	Text Score	42.7	# 26	Compare
			Image Score	27.8	# 23	Compare
			Group Score	17.4	# 38	Compare
Visual Reasoning	Winoground	KeyComp (GPT-3.5)	Text Score	30.3	# 66	Compare
			Image Score	24.6	# 37	Compare
			Group Score	12.4	# 54	Compare

Methods

Add Remove

CLIP

Edit Social Preview

Prompting Large Vision-Language Models for Compositional Reasoning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove