Visual Reasoning

215 papers with code • 12 benchmarks • 41 datasets

Ability to understand actions and reasoning associated with any visual images

Libraries

Use these libraries to find Visual Reasoning models and implementations
3 papers
8,794
3 papers
32
See all 7 libraries.

Most implemented papers

Learning by Abstraction: The Neural State Machine

stanfordnlp/mac-network NeurIPS 2019

We introduce the Neural State Machine, seeking to bridge the gap between the neural and symbolic views of AI and integrate their complementary strengths for the task of visual reasoning.

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

How is ChatGPT's behavior changing over time?

lchen001/llmdrift 18 Jul 2023

We find that the performance and behavior of both GPT-3. 5 and GPT-4 can vary greatly over time.

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

ruotianluo/iep-ref CVPR 2019

Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process.

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

chuangg/CLEVRER ICLR 2020

While these models thrive on the perception-based task (descriptive), they perform poorly on the causal tasks (explanatory, predictive and counterfactual), suggesting that a principled approach for causal reasoning should incorporate the capability of both perceiving complex visual and language inputs, and understanding the underlying dynamics and causal relations.

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

researchmm/soho CVPR 2021

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

Cuberick-Orion/CIRR ICCV 2021

We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion.

Visually Grounded Reasoning across Languages and Cultures

e-bug/volta EMNLP 2021

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet.

FLAVA: A Foundational Language And Vision Alignment Model

facebookresearch/multimodal CVPR 2022

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Collaborative Transformers for Grounded Situation Recognition

jhcho99/coformer CVPR 2022

To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation.