Visual Reasoning
211 papers with code • 12 benchmarks • 41 datasets
Ability to understand actions and reasoning associated with any visual images
Libraries
Use these libraries to find Visual Reasoning models and implementationsLatest papers with no code
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.
Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA
Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) tested, even though humans can attain approximately 99\% accuracy on these questions.
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
Our findings reveal a significant performance gap of 30. 8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning.
Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection
In the odd-one-out task and two held-out configurations, RAISE can leverage acquired latent concepts and atomic rules to find the rule-breaking image in a matrix and handle problems with unseen combinations of rules and attributes.
Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning.
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs.
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning.
ChartBench: A Benchmark for Complex Visual Reasoning in Charts
Multimodal Large Language Models (MLLMs) demonstrate impressive image understanding and generating capabilities.
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
Learning scene graphs from natural language descriptions has proven to be a cheap and promising scheme for Scene Graph Generation (SGG).
Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects
Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks -- from describing object semantics to physical properties.