Multimodal Reasoning

38 papers with code • 3 benchmarks • 4 datasets

Reasoning over multimodal inputs.

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Reasoning

Dataset	Best Model	Compare
REBUS	GPT-4V	See all
MATH-V	GPT4V	See all
AlgoPuzzleVQA	GPT-4	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

Visual Goal-Step Inference using wikiHow

yueyang1996/wikihow-vgsi • EMNLP 2021

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities.

Paper
Code

MERLOT: Multimodal Neural Script Knowledge Models

rowanz/merlot • • NeurIPS 2021

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future.

Paper
Code

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

tgxs002/wikiscenes • • ICCV 2021

The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos.

Paper
Code

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

samuelyu2002/pacs • • 21 Mar 2022

Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes.

Paper
Code

Fine-Grained Visual Entailment

skrighyz/fgve • • 29 Mar 2022

In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.

Paper
Code

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

lukeforeveryoung/qrnet • • CVPR 2022

Moreover, since the backbones are query-agnostic, it is difficult to completely avoid the inconsistency issue by training the visual backbone end-to-end in the visual grounding framework.

Paper
Code

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

tttyuntian/vlm_primitive_concepts • • 31 Mar 2022

CompMap first asks a VL model to generate primitive concept activations with text prompts, and then learns to construct a composition model that maps the primitive concept activations (e. g. the likelihood of black tail or red wing) to composite concepts (e. g. a red-winged blackbird).

Paper
Code

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

google-research/google-research • • 1 Apr 2022

Large pretrained (e. g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on.

Paper
Code

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

lupantech/ScienceQA • • 20 Sep 2022

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.

Paper
Code

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

mitjanikolaus/multimodal-predicate-noun-dependencies • • 21 Oct 2022

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks.

Paper
Code

Multimodal Reasoning

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result