Does Structural Attention Improve Compositional Representations in Vision-Language Models?

Although scaling self-supervised approaches has gained widespread success in Vision-Language pre-training, a number of works providing structural knowledge of visually-grounded semantics have recently shown incremental performance gains. Past work hypothesizes that providing structural knowledge to models in the form of scene graphs, syntax parses, etc. will result in better Structure Alignment and thus maintain representational compositionality, a core feature of human cognition. We compare one such Structural Training model to a Structural Attention model which has only implicitly learned inter-modal structure alignment through a self supervised attention regularizer. We report that the latter model results in a 52% improvement over its baseline on the Winoground evaluation dataset, establishing a new vision-language compositionality state-of-the-art (Group=16.00). We begin exploring why this self-supervised approach succeeds where a more strongly supervised approach fails, specifically analyzing what the auxiliary loss implicitly conveys about structural knowledge.

PDF
No code implementations yet. Submit your code now

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Visual Reasoning Winoground CACR base Text Score 39.25 # 37
Image Score 17.75 # 58
Group Score 14.25 # 48
Visual Reasoning Winoground ROSITA (Flickr30k) Text Score 35.25 # 51
Image Score 15.25 # 68
Group Score 12.25 # 56
Visual Reasoning Winoground IAIS large (COCO) Text Score 41.75 # 32
Image Score 19.75 # 52
Group Score 15.50 # 43
Visual Reasoning Winoground IAIS large (Flickr30k) Text Score 42.50 # 27
Image Score 19.75 # 52
Group Score 16.00 # 42

Methods


No methods listed for this paper. Add relevant methods here