TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRCO	LinCIR (CLIP L/14)	mAP@10	13.58	# 4
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRCO	LinCIR (CLIP G/14)	mAP@10	21.01	# 2
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRR	LinCIR (CLIP L/14)	R@5	53.25	# 11
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRR	LinCIR (CLIP G/14)	R@5	64.72	# 3
Zero-Shot Composed Image Retrieval (ZS-CIR)	Fashion IQ	LinCIR (CLIP G/14)	(Recall@10+Recall@50)/2	55.40	# 1
Zero-Shot Composed Image Retrieval (ZS-CIR)	Fashion IQ	LinCIR (CLIP L/14)	(Recall@10+Recall@50)/2	36.39	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-only-efficient-training-of-zero-shot/zero-shot-composed-image-retrieval-zs-cir-on-2)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-2?p=language-only-efficient-training-of-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-only-efficient-training-of-zero-shot/zero-shot-composed-image-retrieval-zs-cir-on)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on?p=language-only-efficient-training-of-zero-shot)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-only-efficient-training-of-zero-shot/zero-shot-composed-image-retrieval-zs-cir-on-1)](https://paperswithcode.com/sota/zero-shot-composed-image-retrieval-zs-cir-on-1?p=language-only-efficient-training-of-zero-shot)`

Language-only Efficient Training of Zero-shot Composed Image Retrieval

4 Dec 2023 · Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun ·

Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir

PDF Abstract

Code

Add Remove Mark official

navervision/lincir official

↳ Quickstart in

Spaces

Tasks

Add Remove

Image Retrieval

Retrieval

Zero-Shot Composed Image Retrieval (ZS-CIR)

Datasets

MS COCO OpenWebText Fashion IQ

CIRR

CIRCO GeneCIS

Results from the Paper

Edit

Ranked #1 on Zero-Shot Composed Image Retrieval (ZS-CIR) on Fashion IQ

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRCO	LinCIR (CLIP L/14)	mAP@10	13.58	# 4	Compare
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRCO	LinCIR (CLIP G/14)	mAP@10	21.01	# 2	Compare
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRR	LinCIR (CLIP L/14)	R@5	53.25	# 11	Compare
Zero-Shot Composed Image Retrieval (ZS-CIR)	CIRR	LinCIR (CLIP G/14)	R@5	64.72	# 3	Compare
Zero-Shot Composed Image Retrieval (ZS-CIR)	Fashion IQ	LinCIR (CLIP G/14)	(Recall@10+Recall@50)/2	55.40	# 1	Compare
Zero-Shot Composed Image Retrieval (ZS-CIR)	Fashion IQ	LinCIR (CLIP L/14)	(Recall@10+Recall@50)/2	36.39	# 8	Compare

Methods

Add Remove

CLIP

Edit Social Preview

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove