Prompting Large Vision-Language Models for Compositional Reasoning

20 Jan 2024  ·  Timothy Ossowski, Ming Jiang, Junjie Hu ·

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Visual Reasoning Winoground KeyComp* (GPT-4) Text Score 43.5 # 22
Image Score 28.7 # 21
Group Score 18.2 # 36
Visual Reasoning Winoground KeyComp* (GPT-3.5) Text Score 42.7 # 26
Image Score 27.8 # 23
Group Score 17.4 # 38
Visual Reasoning Winoground KeyComp (GPT-3.5) Text Score 30.3 # 66
Image Score 24.6 # 37
Group Score 12.4 # 54

Methods