TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval with Multi-Modal Query	Fashion200k	Css-Net	Recall@1	23.4	# 1
Image Retrieval with Multi-Modal Query	Fashion200k	Css-Net	Recall@10	52.0	# 3
Image Retrieval with Multi-Modal Query	Fashion200k	Css-Net	Recall@50	72.0	# 2
Image Retrieval	Fashion IQ	Css-Net	(Recall@10+Recall@50)/2	51.34	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relieving-triplet-ambiguity-consensus-network/image-retrieval-with-multi-modal-query-on)](https://paperswithcode.com/sota/image-retrieval-with-multi-modal-query-on?p=relieving-triplet-ambiguity-consensus-network)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/relieving-triplet-ambiguity-consensus-network/image-retrieval-on-fashion-iq)](https://paperswithcode.com/sota/image-retrieval-on-fashion-iq?p=relieving-triplet-ambiguity-consensus-network)`

Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval

3 Jun 2023 · Xu Zhang, Zhedong Zheng, Xiaohan Wang, Yi Yang ·

Language-guided image retrieval enables users to search for images and interact with the retrieval system more naturally and expressively by using a reference image and a relative caption as a query. Most existing studies mainly focus on designing image-text composition architecture to extract discriminative visual-linguistic relations. Despite great success, we identify an inherent problem that obstructs the extraction of discriminative features and considerably compromises model training: \textbf{triplet ambiguity}. This problem stems from the annotation process wherein annotators view only one triplet at a time. As a result, they often describe simple attributes, such as color, while neglecting fine-grained details like location and style. This leads to multiple false-negative candidates matching the same modification text. We propose a novel Consensus Network (Css-Net) that self-adaptively learns from noisy triplets to minimize the negative effects of triplet ambiguity. Inspired by the psychological finding that groups perform better than individuals, Css-Net comprises 1) a consensus module featuring four distinct compositors that generate diverse fused image-text embeddings and 2) a Kullback-Leibler divergence loss, which fosters learning among the compositors, enabling them to reduce biases learned from noisy triplets and reach a consensus. The decisions from four compositors are weighted during evaluation to further achieve consensus. Comprehensive experiments on three datasets demonstrate that Css-Net can alleviate triplet ambiguity, achieving competitive performance on benchmarks, such as $+2.77\%$ R@10 and $+6.67\%$ R@50 on FashionIQ.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Retrieval

Image Retrieval with Multi-Modal Query

Retrieval

Datasets

Fashion IQ

Results from the Paper

Edit

Ranked #1 on Image Retrieval with Multi-Modal Query on Fashion200k

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval with Multi-Modal Query	Fashion200k	Css-Net	Recall@1	23.4	# 1	Compare
			Recall@10	52.0	# 3	Compare
			Recall@50	72.0	# 2	Compare
Image Retrieval	Fashion IQ	Css-Net	(Recall@10+Recall@50)/2	51.34	# 9	Compare

Methods

Add Remove

Focus

Edit Social Preview

Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove