Visual Reasoning

211 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Reasoning

Dataset	Best Model	Compare
Winoground	GPT-4V (CoT, pick b/w two options)	See all
NLVR2 Dev	BEiT-3	See all
NLVR2 Test	BEiT-3	See all
WinoGAViL	Humans	See all
Bongard-OpenWorld	Human	See all
VSR	LXMERT	See all
PHYRE-1B-Within	RPIN	See all
PHYRE-1B-Cross	RPIN	See all
VASR	Swin	See all
NLVR	VisualBERT	See all
IRFL: Image Recognition of Figurative Language	Humans	See all
CLEVRER	AI Core	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Reasoning models and implementations

huggingface/transformers

5 papers

124,793

facebookresearch/multimodal

4 papers

1,289

salesforce/lavis

3 papers

8,701

kakao/DAFT

3 papers

See all 7 libraries.

Datasets

Subtasks

Visual Commonsense Reasoning

Latest papers with no code

Most implemented Social Latest No code

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

no code yet • 9 Feb 2024

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities.

Paper
Add Code

Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

no code yet • 29 Jan 2024

Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) tested, even though humans can attain approximately 99\% accuracy on these questions.

Paper
Add Code

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

no code yet • 24 Jan 2024

Our findings reveal a significant performance gap of 30. 8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning.

Paper
Add Code

Towards Generative Abstract Reasoning: Completing Raven's Progressive Matrix via Rule Abstraction and Selection

no code yet • 18 Jan 2024

In the odd-one-out task and two held-out configurations, RAISE can leverage acquired latent concepts and atomic rules to find the rule-breaking image in a matrix and handle problems with unseen combinations of rules and attributes.

Paper
Add Code

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

no code yet • 8 Jan 2024

The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning.

Paper
Add Code

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

no code yet • 5 Jan 2024

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs.

Paper
Add Code

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

no code yet • 3 Jan 2024

Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning.

Paper
Add Code

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

no code yet • 26 Dec 2023

Multimodal Large Language Models (MLLMs) demonstrate impressive image understanding and generating capabilities.

Paper
Add Code

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

no code yet • 7 Dec 2023

Learning scene graphs from natural language descriptions has proven to be a cheap and promising scheme for Scene Graph Generation (SGG).

Paper
Add Code

Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects

no code yet • 29 Nov 2023

Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks -- from describing object semantics to physical properties.

Paper
Add Code

Visual Reasoning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result