Visual Reasoning
211 papers with code • 12 benchmarks • 41 datasets
Ability to understand actions and reasoning associated with any visual images
Libraries
Use these libraries to find Visual Reasoning models and implementationsLatest papers
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness.
Compositional Chain-of-Thought Prompting for Large Multimodal Models
The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks.
Solving ARC visual analogies with neural embeddings and vector arithmetic: A generalized method
This project focuses on visual analogical reasoning and applies the initial generalized mechanism used to solve verbal analogies to the visual realm.
NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph Enrichment
We present a loosely-coupled neuro-symbolic visual understanding and reasoning framework that employs a DNN-based pipeline for object detection and multi-modal pairwise relationship prediction for scene graph generation and leverages common sense knowledge in heterogenous knowledge graphs to enrich scene graphs for improved downstream reasoning.
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
By conducting a comprehensive empirical study, we find that instructions focused on complex visual reasoning tasks are particularly effective in improving the performance of MLLMs on evaluation benchmarks.
Weakly Supervised Semantic Parsing with Execution-based Spurious Program Filtering
The problem of spurious programs is a longstanding challenge when training a semantic parser from weak supervision.
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese
Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English.
What's Left? Concept Grounding with Logic-Enhanced Foundation Models
We propose the Logic-Enhanced Foundation Model (LEFT), a unified framework that learns to ground and reason with concepts across domains with a differentiable, domain-independent, first-order logic-based program executor.
Interpreting and Controlling Vision Foundation Models via Text Explanations
Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks.
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.