Prompting Large Vision-Language Models for Compositional Reasoning
Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.
PDF AbstractCode
Tasks
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Visual Reasoning | Winoground | KeyComp* (GPT-4) | Text Score | 43.5 | # 22 | |
Image Score | 28.7 | # 21 | ||||
Group Score | 18.2 | # 36 | ||||
Visual Reasoning | Winoground | KeyComp* (GPT-3.5) | Text Score | 42.7 | # 26 | |
Image Score | 27.8 | # 23 | ||||
Group Score | 17.4 | # 38 | ||||
Visual Reasoning | Winoground | KeyComp (GPT-3.5) | Text Score | 30.3 | # 66 | |
Image Score | 24.6 | # 37 | ||||
Group Score | 12.4 | # 54 |