Visual Commonsense Reasoning
29 papers with code • 7 benchmarks • 7 datasets
Image source: Visual Commonsense Reasoning
Datasets
Latest papers with no code
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning.
Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks
To evaluate our results, we use a dataset focusing on visual commonsense reasoning in time.
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.
Attention Mechanism based Cognition-level Scene Understanding
Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world.
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.
Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues
It is a common practice for recent works in vision language cross-modal reasoning to adopt a binary or multi-choice classification formulation taking as input a set of source image(s) and textual query.
Playing Lottery Tickets with Vision and Language
However, we can find "relaxed" winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy.