Visual Reasoning
214 papers with code • 12 benchmarks • 41 datasets
Ability to understand actions and reasoning associated with any visual images
Libraries
Use these libraries to find Visual Reasoning models and implementationsLatest papers with no code
ChartBench: A Benchmark for Complex Visual Reasoning in Charts
Multimodal Large Language Models (MLLMs) demonstrate impressive image understanding and generating capabilities.
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
Learning scene graphs from natural language descriptions has proven to be a cheap and promising scheme for Scene Graph Generation (SGG).
Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects
Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks -- from describing object semantics to physical properties.
From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation
Addressing the challenge of adapting pre-trained vision-language models for generating insightful explanations for visual reasoning tasks with limited annotations, we present ReVisE: a $\textbf{Re}$cursive $\textbf{Vis}$ual $\textbf{E}$xplanation algorithm.
SelfEval: Leveraging the discriminative nature of generative models for evaluation
In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner.
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning.
Adaptive recurrent vision performs zero-shot computation scaling to unseen difficulty levels
In this study, we investigate a critical functional role of such adaptive processing using recurrent neural networks: to dynamically scale computational resources conditional on input requirements that allow for zero-shot generalization to novel difficulty levels not seen during training using two challenging visual reasoning tasks: PathFinder and Mazes.
Visual Commonsense based Heterogeneous Graph Contrastive Learning
Specifically, our model contains two key components: the Commonsense-based Contrastive Learning and the Graph Relation Network.
Towards A Unified Neural Architecture for Visual Recognition and Reasoning
Motivated by the recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e. g., tokens) for both.
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs
If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module.