Visual Commonsense Reasoning

29 papers with code • 7 benchmarks • 7 datasets

Latest papers with no code

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

no code yet • 24 May 2022

Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning.

Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks

no code yet • 25 Apr 2022

To evaluate our results, we use a dataset focusing on visual commonsense reasoning in time.

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code yet • 22 Apr 2022

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

Attention Mechanism based Cognition-level Scene Understanding

no code yet • 17 Apr 2022

Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world.

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code yet • 15 Jan 2022

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

no code yet • CVPR 2022

Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

no code yet • 16 Dec 2021

As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.

Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues

no code yet • ACL 2022

It is a common practice for recent works in vision language cross-modal reasoning to adopt a binary or multi-choice classification formulation taking as input a set of source image(s) and textual query.

Playing Lottery Tickets with Vision and Language

no code yet • 23 Apr 2021

However, we can find "relaxed" winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy.