Visual Commonsense Reasoning

29 papers with code • 7 benchmarks • 7 datasets

Image source: Visual Commonsense Reasoning

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Commonsense Reasoning

Dataset	Best Model	Compare
GD-VCR	VisualBERT	See all
VCR (Q-AR) test	PEVL	See all
VCR (QA-R) test	PEVL	See all
VCR (Q-A) test	PEVL	See all
VCR (Q-A) dev	PEVL	See all
VCR (QA-R) dev	PEVL	See all
VCR (Q-AR) dev	PEVL	See all

Datasets

Most implemented papers

Most implemented Social Latest No code

Joint Answering and Explanation for Visual Commonsense Reasoning

sdlzy/arc • • 25 Feb 2022

Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset.

Paper
Code

All in One: Exploring Unified Video-Language Pre-training

showlab/all-in-one • • CVPR 2023

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Paper
Code

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

intellabs/vl-interpret • CVPR 2022

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems.

Paper
Code

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

thunlp/pevl • • 23 May 2022

We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.

Paper
Code

ILLUME: Rationalizing Vision-Language Models through Human Interactions

ml-research/ILLUME • • 17 Aug 2022

Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering.

Paper
Code

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

lupantech/ScienceQA • • 20 Sep 2022

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.

Paper
Code

VASR: Visual Analogies of Situation Recognition

vasr-dataset/vasr • • 8 Dec 2022

We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies.

Paper
Code

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

jiwanchung/esper • • CVPR 2023

Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e. g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity.

Paper
Code

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning • • 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.

Paper
Code

Visual Commonsense Reasoning

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result