Visual Reasoning

211 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Libraries

Use these libraries to find Visual Reasoning models and implementations
3 papers
8,701
3 papers
32
See all 7 libraries.

Latest papers with no code

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

no code yet • 9 Feb 2024

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.

Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

no code yet • 29 Jan 2024

Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) tested, even though humans can attain approximately 99\% accuracy on these questions.

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

no code yet • 24 Jan 2024

Our findings reveal a significant performance gap of 30. 8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning.

Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection

no code yet • 18 Jan 2024

In the odd-one-out task and two held-out configurations, RAISE can leverage acquired latent concepts and atomic rules to find the rule-breaking image in a matrix and handle problems with unseen combinations of rules and attributes.

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

no code yet • 8 Jan 2024

The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning.

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

no code yet • 5 Jan 2024

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs.

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

no code yet • 3 Jan 2024

Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning.

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

no code yet • 26 Dec 2023

Multimodal Large Language Models (MLLMs) demonstrate impressive image understanding and generating capabilities.

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

no code yet • 7 Dec 2023

Learning scene graphs from natural language descriptions has proven to be a cheap and promising scheme for Scene Graph Generation (SGG).

Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects

no code yet • 29 Nov 2023

Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks -- from describing object semantics to physical properties.