Visual Commonsense Reasoning

29 papers with code • 7 benchmarks • 7 datasets

Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

From Recognition to Cognition: Visual Commonsense Reasoning

rowanz/r2c CVPR 2019

While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world.

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

jackroos/VL-BERT ICLR 2020

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

Unifying Vision-and-Language Tasks via Text Generation

j-min/VL-T5 4 Feb 2021

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

yehli/xmodaler 18 Aug 2021

Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

jshilong/gpt4roi 7 Jul 2023

Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence.

Think Visually: Question Answering through Virtual Imagery

umich-vl/think_visually ACL 2018

In this paper, we study the problem of geometric reasoning in the context of question-answering.

Fusion of Detected Objects in Text for Visual Question Answering

google-research/language IJCNLP 2019

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language.