The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

15 Nov 2023  ·  Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie ·

The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning. We present the "Description then Decision" strategy, which is inspired by how humans process signals. This strategy significantly improves probing task performance by 50%, establishing the groundwork for future research on reasoning paradigms in complex vision-language tasks.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Reasoning Winoground GPT-4V (CoT, pick b/w two options) Text Score 75.25 # 1
Image Score 68.75 # 1
Group Score 58.75 # 1
Visual Reasoning Winoground GPT-4V (pick b/w two options) Text Score 69.25 # 2
Image Score 46.25 # 7
Group Score 39.25 # 7

Methods


No methods listed for this paper. Add relevant methods here