TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	Winoground	CACR base	Text Score	39.25	# 37
Visual Reasoning	Winoground	CACR base	Image Score	17.75	# 58
Visual Reasoning	Winoground	CACR base	Group Score	14.25	# 48
Visual Reasoning	Winoground	ROSITA (Flickr30k)	Text Score	35.25	# 51
Visual Reasoning	Winoground	ROSITA (Flickr30k)	Image Score	15.25	# 68
Visual Reasoning	Winoground	ROSITA (Flickr30k)	Group Score	12.25	# 56
Visual Reasoning	Winoground	IAIS large (COCO)	Text Score	41.75	# 32
Visual Reasoning	Winoground	IAIS large (COCO)	Image Score	19.75	# 52
Visual Reasoning	Winoground	IAIS large (COCO)	Group Score	15.50	# 43
Visual Reasoning	Winoground	IAIS large (Flickr30k)	Text Score	42.50	# 27
Visual Reasoning	Winoground	IAIS large (Flickr30k)	Image Score	19.75	# 52
Visual Reasoning	Winoground	IAIS large (Flickr30k)	Group Score	16.00	# 42

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/does-structural-attention-improve/visual-reasoning-on-winoground)](https://paperswithcode.com/sota/visual-reasoning-on-winoground?p=does-structural-attention-improve)`

Does Structural Attention Improve Compositional Representations in Vision-Language Models?

NeurIPS Workshop: Self-Supervised Learning - Theory and Practice 2022 · Rohan Pandey, Rulin Shao, Paul Pu Liang, Louis-Philippe Morency ·

Although scaling self-supervised approaches has gained widespread success in Vision-Language pre-training, a number of works providing structural knowledge of visually-grounded semantics have recently shown incremental performance gains. Past work hypothesizes that providing structural knowledge to models in the form of scene graphs, syntax parses, etc. will result in better Structure Alignment and thus maintain representational compositionality, a core feature of human cognition. We compare one such Structural Training model to a Structural Attention model which has only implicitly learned inter-modal structure alignment through a self supervised attention regularizer. We report that the latter model results in a 52% improvement over its baseline on the Winoground evaluation dataset, establishing a new vision-language compositionality state-of-the-art (Group=16.00). We begin exploring why this self-supervised approach succeeds where a more strongly supervised approach fails, specifically analyzing what the auxiliary loss implicitly conveys about structural knowledge.

PDF

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Visual Reasoning

Datasets

Flickr30k Winoground

Results from the Paper

Add Remove

Ranked #27 on Visual Reasoning on Winoground

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	Winoground	CACR base	Text Score	39.25	# 37	Compare
			Image Score	17.75	# 58	Compare
			Group Score	14.25	# 48	Compare
Visual Reasoning	Winoground	ROSITA (Flickr30k)	Text Score	35.25	# 51	Compare
			Image Score	15.25	# 68	Compare
			Group Score	12.25	# 56	Compare
Visual Reasoning	Winoground	IAIS large (COCO)	Text Score	41.75	# 32	Compare
			Image Score	19.75	# 52	Compare
			Group Score	15.50	# 43	Compare
Visual Reasoning	Winoground	IAIS large (Flickr30k)	Text Score	42.50	# 27	Compare
			Image Score	19.75	# 52	Compare
			Group Score	16.00	# 42	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Does Structural Attention Improve Compositional Representations in Vision-Language Models?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove