Visual Reasoning

215 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Reasoning

Dataset	Best Model	Compare
Winoground	GPT-4V (CoT, pick b/w two options)	See all
NLVR2 Dev	BEiT-3	See all
NLVR2 Test	BEiT-3	See all
WinoGAViL	Humans	See all
Bongard-OpenWorld	Human	See all
VSR	LXMERT	See all
PHYRE-1B-Within	RPIN	See all
PHYRE-1B-Cross	RPIN	See all
VASR	Swin	See all
NLVR	VisualBERT	See all
IRFL: Image Recognition of Figurative Language	Humans	See all
CLEVRER	AI Core	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Reasoning models and implementations

huggingface/transformers

5 papers

125,545

facebookresearch/multimodal

4 papers

1,305

salesforce/lavis

3 papers

8,804

kakao/DAFT

3 papers

See all 7 libraries.

Datasets

Subtasks

Visual Commonsense Reasoning

Latest papers

Most implemented Social Latest No code

Interpreting and Controlling Vision Foundation Models via Text Explanations

tonychenxyz/vit-interpret • • 16 Oct 2023

Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks.

16 Oct 2023

Paper
Code

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

joyjayng/Bongard-OpenWorld • • 16 Oct 2023

We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.

16 Oct 2023

Paper
Code

Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

ellenzhuwang/implicit_vkood • • NeurIPS 2023

Deep network models are often purely inductive during both training and inference on unseen data.

21 Sep 2023

Paper
Code

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

haozhezhao/mic • • 14 Sep 2023

In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.

299

14 Sep 2023

Paper
Code

Collecting Visually-Grounded Dialogue with A Game Of Sorts

willemsenbram/a-game-of-sorts • LREC 2022

We address these concerns by introducing a collaborative image ranking task, a grounded agreement game we call "A Game Of Sorts".

10 Sep 2023

Paper
Code

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

yangyi-chen/cotconsistency • 8 Sep 2023

Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs.

08 Sep 2023

Paper
Code

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning • • 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.

05 Sep 2023

Paper
Code

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

hypjudy/sparkles • • 31 Aug 2023

Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns.

31 Aug 2023

Paper
Code

An Examination of the Compositionality of Large Generative Vision-Language Models

teleema/sade • 21 Aug 2023

A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness.

21 Aug 2023

Paper
Code

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

henryhzy/vl-pet • • ICCV 2023

In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2. 92% (3. 41%) and LoRA by 3. 37% (7. 03%) with BART-base (T5-base) on image-text tasks.

18 Aug 2023

Paper
Code

Visual Reasoning

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result