Visual Reasoning
215 papers with code • 12 benchmarks • 41 datasets
Ability to understand actions and reasoning associated with any visual images
Libraries
Use these libraries to find Visual Reasoning models and implementationsLatest papers
Interpreting and Controlling Vision Foundation Models via Text Explanations
Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks.
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
Deep network models are often purely inductive during both training and inference on unseen data.
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.
Collecting Visually-Grounded Dialogue with A Game Of Sorts
We address these concerns by introducing a collaborative image ranking task, a grounded agreement game we call "A Game Of Sorts".
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs.
A Survey on Interpretable Cross-modal Reasoning
In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns.
An Examination of the Compositionality of Large Generative Vision-Language Models
A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness.
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
In particular, our VL-PET-large with lightweight PET module designs significantly outperforms VL-Adapter by 2. 92% (3. 41%) and LoRA by 3. 37% (7. 03%) with BART-base (T5-base) on image-text tasks.