Visual Reasoning

215 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Reasoning

Dataset	Best Model	Compare
Winoground	GPT-4V (CoT, pick b/w two options)	See all
NLVR2 Dev	BEiT-3	See all
NLVR2 Test	BEiT-3	See all
WinoGAViL	Humans	See all
Bongard-OpenWorld	Human	See all
VSR	LXMERT	See all
PHYRE-1B-Within	RPIN	See all
PHYRE-1B-Cross	RPIN	See all
VASR	Swin	See all
NLVR	VisualBERT	See all
IRFL: Image Recognition of Figurative Language	Humans	See all
CLEVRER	AI Core	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Reasoning models and implementations

huggingface/transformers

5 papers

125,478

facebookresearch/multimodal

4 papers

1,304

salesforce/lavis

3 papers

8,794

kakao/DAFT

3 papers

See all 7 libraries.

Datasets

Subtasks

Visual Commonsense Reasoning

Most implemented papers

Most implemented Social Latest No code

Learning by Abstraction: The Neural State Machine

stanfordnlp/mac-network • • NeurIPS 2019

We introduce the Neural State Machine, seeking to bridge the gap between the neural and symbolic views of AI and integrate their complementary strengths for the task of visual reasoning.

Paper
Code

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa • • 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

Paper
Code

How is ChatGPT's behavior changing over time?

lchen001/llmdrift • 18 Jul 2023

We find that the performance and behavior of both GPT-3. 5 and GPT-4 can vary greatly over time.

Paper
Code

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

ruotianluo/iep-ref • • CVPR 2019

Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.

Paper
Code

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

chuangg/CLEVRER • • ICLR 2020

While these models thrive on the perception-based task (descriptive), they perform poorly on the causal tasks (explanatory, predictive and counterfactual), suggesting that a principled approach for causal reasoning should incorporate the capability of both perceiving complex visual and language inputs, and understanding the underlying dynamics and causal relations.

Paper
Code

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

researchmm/soho • • CVPR 2021

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Paper
Code

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

Cuberick-Orion/CIRR • ICCV 2021

We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion.

Paper
Code

Visually Grounded Reasoning across Languages and Cultures

e-bug/volta • • EMNLP 2021

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet.

Paper
Code

FLAVA: A Foundational Language And Vision Alignment Model

facebookresearch/multimodal • • CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Paper
Code

Collaborative Transformers for Grounded Situation Recognition

jhcho99/coformer • • CVPR 2022

To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation.

Paper
Code

Visual Reasoning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result