Visual Commonsense Reasoning

29 papers with code • 7 benchmarks • 7 datasets

Most implemented papers

Joint Answering and Explanation for Visual Commonsense Reasoning

sdlzy/arc 25 Feb 2022

Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset.

All in One: Exploring Unified Video-Language Pre-training

showlab/all-in-one CVPR 2023

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

intellabs/vl-interpret CVPR 2022

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems.

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

thunlp/pevl 23 May 2022

We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.

ILLUME: Rationalizing Vision-Language Models through Human Interactions

ml-research/ILLUME 17 Aug 2022

Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering.

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

lupantech/ScienceQA 20 Sep 2022

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.

VASR: Visual Analogies of Situation Recognition

vasr-dataset/vasr 8 Dec 2022

We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies.

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

jiwanchung/esper CVPR 2023

Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e. g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity.

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.