Visual Reasoning
211 papers with code • 12 benchmarks • 41 datasets
Ability to understand actions and reasoning associated with any visual images
Libraries
Use these libraries to find Visual Reasoning models and implementationsLatest papers
MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems
Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts.
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
When visual tables serve as standalone visual representations, our model can closely match or even beat the SOTA MLLMs that are built on CLIP visual embeddings.
How Far Are We from Intelligent Visual Deductive Reasoning?
Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks.
Slot Abstractors: Toward Scalable Abstract Visual Reasoning
Abstract visual reasoning is a characteristically human ability, allowing the identification of relational patterns that are abstracted away from object features, and the systematic generalization of those patterns to unseen problems.
What Is Missing in Multilingual Visual Reasoning and How to Fix It
NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users.
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension.
Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning
This paper further investigates the necessity of disentangled representation in downstream applications.
PALO: A Polyglot Large Multimodal Model for 5B People
PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population).
Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach
To this end, these models use several kinds of attention mechanisms to segregate the individual objects in a scene from the background and from other objects.
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers.