Does Structural Attention Improve Compositional Representations in Vision-Language Models?
Although scaling self-supervised approaches has gained widespread success in Vision-Language pre-training, a number of works providing structural knowledge of visually-grounded semantics have recently shown incremental performance gains. Past work hypothesizes that providing structural knowledge to models in the form of scene graphs, syntax parses, etc. will result in better Structure Alignment and thus maintain representational compositionality, a core feature of human cognition. We compare one such Structural Training model to a Structural Attention model which has only implicitly learned inter-modal structure alignment through a self supervised attention regularizer. We report that the latter model results in a 52% improvement over its baseline on the Winoground evaluation dataset, establishing a new vision-language compositionality state-of-the-art (Group=16.00). We begin exploring why this self-supervised approach succeeds where a more strongly supervised approach fails, specifically analyzing what the auxiliary loss implicitly conveys about structural knowledge.
PDFTasks
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Visual Reasoning | Winoground | CACR base | Text Score | 39.25 | # 37 | |
Image Score | 17.75 | # 58 | ||||
Group Score | 14.25 | # 48 | ||||
Visual Reasoning | Winoground | ROSITA (Flickr30k) | Text Score | 35.25 | # 51 | |
Image Score | 15.25 | # 68 | ||||
Group Score | 12.25 | # 56 | ||||
Visual Reasoning | Winoground | IAIS large (COCO) | Text Score | 41.75 | # 32 | |
Image Score | 19.75 | # 52 | ||||
Group Score | 15.50 | # 43 | ||||
Visual Reasoning | Winoground | IAIS large (Flickr30k) | Text Score | 42.50 | # 27 | |
Image Score | 19.75 | # 52 | ||||
Group Score | 16.00 | # 42 |