Multimodal Reasoning
38 papers with code • 3 benchmarks • 4 datasets
Reasoning over multimodal inputs.
Most implemented papers
Visual Goal-Step Inference using wikiHow
Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities.
MERLOT: Multimodal Neural Script Knowledge Models
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future.
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision
The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos.
PACS: A Dataset for Physical Audiovisual CommonSense Reasoning
Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes.
Fine-Grained Visual Entailment
In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
Moreover, since the backbones are query-agnostic, it is difficult to completely avoid the inconsistency issue by training the visual backbone end-to-end in the visual grounding framework.
Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
CompMap first asks a VL model to generate primitive concept activations with text prompts, and then learns to construct a composition model that maps the primitive concept activations (e. g. the likelihood of black tail or red wing) to composite concepts (e. g. a red-winged blackbird).
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Large pretrained (e. g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on.
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks.