Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

13 Nov 2023  ·  Junyang Chen, Hanjiang Lai ·

Zero-shot composed image retrieval (ZS-CIR), which aims to retrieve a target image based on textual modifications to a reference image without triplet labeling, has gained more and more attention. Current ZS-CIR research mainly relies on two unlabeled pre-trained models: the vision-language model, e.g., CLIP, and the Pic2Word/textual inversion model. However, the pre-trained models and CIR tasks have substantial discrepancies, where the pre-trained models learn the similarities between vision and language but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach to reduce the gap between the pre-trained model and the downstream CIR task. We first reformulate the pre-trained vision-language contrastive learning as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triple from an image-text pair. Then, we propose a masked tuning, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, it can learn to capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on three ZS-CIR datasets, including FashionIQ, CIRR, and CIRCO.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRCO MTCIR (BLIP B/16) mAP@10 8.03 # 10
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRCO MTCIR (CLIP L/14) mAP@10 11.63 # 7
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRR MTCIR (BLIP B/16) R@5 58.87 # 5
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRR MTCIR (CLIP L/14) R@5 54.58 # 8
Zero-Shot Composed Image Retrieval (ZS-CIR) Fashion IQ MTCIR (CLIP L/14) (Recall@10+Recall@50)/2 46.42 # 2

Methods