Multimodal Reasoning

38 papers with code • 3 benchmarks • 4 datasets

Reasoning over multimodal inputs.

Most implemented papers

Visual Goal-Step Inference using wikiHow

yueyang1996/wikihow-vgsi EMNLP 2021

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities.

MERLOT: Multimodal Neural Script Knowledge Models

rowanz/merlot NeurIPS 2021

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

tgxs002/wikiscenes ICCV 2021

The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos.

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

samuelyu2002/pacs 21 Mar 2022

Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes.

Fine-Grained Visual Entailment

skrighyz/fgve 29 Mar 2022

In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

lukeforeveryoung/qrnet CVPR 2022

Moreover, since the backbones are query-agnostic, it is difficult to completely avoid the inconsistency issue by training the visual backbone end-to-end in the visual grounding framework.

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

tttyuntian/vlm_primitive_concepts 31 Mar 2022

CompMap first asks a VL model to generate primitive concept activations with text prompts, and then learns to construct a composition model that maps the primitive concept activations (e. g. the likelihood of black tail or red wing) to composite concepts (e. g. a red-winged blackbird).

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

google-research/google-research 1 Apr 2022

Large pretrained (e. g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on.

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

lupantech/ScienceQA 20 Sep 2022

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

mitjanikolaus/multimodal-predicate-noun-dependencies 21 Oct 2022

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks.