TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	ConQA Conceptual	CLIP	Recall@1	12.2	# 1
Image Retrieval	ConQA Conceptual	CLIP	Recall@5	30.6	# 1
Image Retrieval	ConQA Conceptual	CLIP	Recall@10	36.7	# 2
Image Retrieval	ConQA Conceptual	CLIP	R-precision	6.8	# 1
Image Retrieval	ConQA Conceptual	SGRAF	Recall@1	0.0	# 5
Image Retrieval	ConQA Conceptual	SGRAF	Recall@5	8.2	# 5
Image Retrieval	ConQA Conceptual	SGRAF	Recall@10	10.2	# 5
Image Retrieval	ConQA Conceptual	SGRAF	R-precision	1.3	# 5
Image Retrieval	ConQA Conceptual	NAAF	Recall@1	4.1	# 3
Image Retrieval	ConQA Conceptual	NAAF	Recall@5	12.2	# 4
Image Retrieval	ConQA Conceptual	NAAF	Recall@10	16.3	# 4
Image Retrieval	ConQA Conceptual	NAAF	R-precision	2.4	# 4
Image Retrieval	ConQA Conceptual	BLIP 2	Recall@1	8.2	# 2
Image Retrieval	ConQA Conceptual	BLIP 2	Recall@5	28.6	# 2
Image Retrieval	ConQA Conceptual	BLIP 2	Recall@10	36.7	# 2
Image Retrieval	ConQA Conceptual	BLIP 2	R-precision	5.4	# 2
Image Retrieval	ConQA Conceptual	BLIP	Recall@1	4.1	# 3
Image Retrieval	ConQA Conceptual	BLIP	Recall@5	28.6	# 2
Image Retrieval	ConQA Conceptual	BLIP	Recall@10	40.8	# 1
Image Retrieval	ConQA Conceptual	BLIP	R-precision	5.4	# 2
Image Retrieval	ConQA Descriptive	SGRAF	Recall@1	6.9	# 5
Image Retrieval	ConQA Descriptive	SGRAF	Recall@5	24.1	# 5
Image Retrieval	ConQA Descriptive	SGRAF	Recall@10	34.5	# 5
Image Retrieval	ConQA Descriptive	SGRAF	R-precision	7.9	# 5
Image Retrieval	ConQA Descriptive	NAAF	Recall@1	13.8	# 4
Image Retrieval	ConQA Descriptive	NAAF	Recall@5	34.5	# 4
Image Retrieval	ConQA Descriptive	NAAF	Recall@10	44.8	# 4
Image Retrieval	ConQA Descriptive	NAAF	R-precision	10.6	# 4
Image Retrieval	ConQA Descriptive	BLIP	Recall@1	20.7	# 1
Image Retrieval	ConQA Descriptive	BLIP	Recall@5	58.3	# 1
Image Retrieval	ConQA Descriptive	BLIP	Recall@10	62.1	# 2
Image Retrieval	ConQA Descriptive	BLIP	R-precision	15.3	# 2
Image Retrieval	ConQA Descriptive	BLIP-2	Recall@1	20.7	# 1
Image Retrieval	ConQA Descriptive	BLIP-2	Recall@5	51.7	# 3
Image Retrieval	ConQA Descriptive	BLIP-2	Recall@10	62.1	# 2
Image Retrieval	ConQA Descriptive	BLIP-2	R-precision	15.3	# 2
Image Retrieval	ConQA Descriptive	CLIP	Recall@1	20.7	# 1
Image Retrieval	ConQA Descriptive	CLIP	Recall@5	58.3	# 1
Image Retrieval	ConQA Descriptive	CLIP	Recall@10	65.5	# 1
Image Retrieval	ConQA Descriptive	CLIP	R-precision	16.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/does-the-performance-of-text-to-image/image-retrieval-on-conqa-conceptual)](https://paperswithcode.com/sota/image-retrieval-on-conqa-conceptual?p=does-the-performance-of-text-to-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/does-the-performance-of-text-to-image/image-retrieval-on-conqa-descriptive)](https://paperswithcode.com/sota/image-retrieval-on-conqa-descriptive?p=does-the-performance-of-text-to-image)`

Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

European Conference on Information Retrieval 2024 · Juan Manuel Rodriguez, Nima Tavassoli, Eliezer Levy, Gil Lederman, Dima Sivov, Matteo Lissandrini, Davide Mottin ·

Text-image retrieval (T2I) refers to the task of recovering all images relevant to a keyword query. Popular datasets for text-image retrieval, such as Flickr30k, VG, or MS-COCO, utilize annotated image captions, e.g., “a man playing with a kid”, as a surrogate for queries. With such surrogate queries, current multi-modal machine learning models, such as CLIP or BLIP, perform remarkably well. The main reason is the descriptive nature of captions, which detail the content of an image. Yet, T2I queries go beyond the mere descriptions in image-caption pairs. Thus, these datasets are ill-suited to test methods on more abstract or conceptual queries, e.g., “family vacations”. In such queries, the image content is implied rather than explicitly described. In this paper, we replicate the T2I results on descriptive queries and generalize them to conceptual queries. To this end, we perform new experiments on a novel T2I benchmark for the task of conceptual query answering, called ConQA. ConQA comprises 30 descriptive and 50 conceptual queries on 43k images with more than 100 manually annotated images per query. Our results on established measures show that both large pretrained models (e.g., CLIP, BLIP, and BLIP2) and small models (e.g., SGRAF and NAAF), perform up to 4x better on descriptive rather than conceptual queries. We also find that the models perform better on queries with more than 6 keywords as in MS-COCO captions.

PDF

Code

Add Remove Mark official

AU-DIS/ConQA

Tasks

Add Remove

Descriptive

Image Captioning

Image Retrieval

Retrieval

Datasets

Introduced in the Paper:

ConQA

Results from the Paper

Add Remove

Ranked #1 on Image Retrieval on ConQA Conceptual

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	ConQA Conceptual	CLIP	Recall@1	12.2	# 1	Compare
			Recall@5	30.6	# 1	Compare
			Recall@10	36.7	# 2	Compare
			R-precision	6.8	# 1	Compare
Image Retrieval	ConQA Conceptual	SGRAF	Recall@1	0.0	# 5	Compare
			Recall@5	8.2	# 5	Compare
			Recall@10	10.2	# 5	Compare
			R-precision	1.3	# 5	Compare
Image Retrieval	ConQA Conceptual	NAAF	Recall@1	4.1	# 3	Compare
			Recall@5	12.2	# 4	Compare
			Recall@10	16.3	# 4	Compare
			R-precision	2.4	# 4	Compare
Image Retrieval	ConQA Conceptual	BLIP 2	Recall@1	8.2	# 2	Compare
			Recall@5	28.6	# 2	Compare
			Recall@10	36.7	# 2	Compare
			R-precision	5.4	# 2	Compare
Image Retrieval	ConQA Conceptual	BLIP	Recall@1	4.1	# 3	Compare
			Recall@5	28.6	# 2	Compare
			Recall@10	40.8	# 1	Compare
			R-precision	5.4	# 2	Compare
Image Retrieval	ConQA Descriptive	SGRAF	Recall@1	6.9	# 5	Compare
			Recall@5	24.1	# 5	Compare
			Recall@10	34.5	# 5	Compare
			R-precision	7.9	# 5	Compare
Image Retrieval	ConQA Descriptive	NAAF	Recall@1	13.8	# 4	Compare
			Recall@5	34.5	# 4	Compare
			Recall@10	44.8	# 4	Compare
			R-precision	10.6	# 4	Compare
Image Retrieval	ConQA Descriptive	BLIP	Recall@1	20.7	# 1	Compare
			Recall@5	58.3	# 1	Compare
			Recall@10	62.1	# 2	Compare
			R-precision	15.3	# 2	Compare
Image Retrieval	ConQA Descriptive	BLIP-2	Recall@1	20.7	# 1	Compare
			Recall@5	51.7	# 3	Compare
			Recall@10	62.1	# 2	Compare
			R-precision	15.3	# 2	Compare
Image Retrieval	ConQA Descriptive	CLIP	Recall@1	20.7	# 1	Compare
			Recall@5	58.3	# 1	Compare
			Recall@10	65.5	# 1	Compare
			R-precision	16.5	# 1	Compare

Methods

Add Remove

BLIP • CLIP

Edit Social Preview

Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove