TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Reasoning	Winoground	ViT-B/16 + BERT base + ViLEM	Text Score	36.5	# 45
Visual Reasoning	Winoground	ViT-B/16 + BERT base	Text Score	31.2	# 61

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vilem-visual-language-error-modeling-for/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=vilem-visual-language-error-modeling-for)`

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

CVPR 2023 · Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Weiming Hu, XiaoHu Qie, Jianping Wu ·

Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Furthermore, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Contrastive Learning

Retrieval

Text Retrieval

Video-Text Retrieval

Visual Reasoning

Datasets

MS COCO

MSR-VTT Winoground

Results from the Paper

Add Remove

Ranked #45 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	ViT-B/16 + BERT base + ViLEM	Text Score	36.5	# 45	Compare
Visual Reasoning	Winoground	ViT-B/16 + BERT base	Text Score	31.2	# 61	Compare

Methods

Add Remove

Contrastive Learning

Edit Social Preview

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove