Visual Commonsense Reasoning

29 papers with code • 7 benchmarks • 7 datasets

A Survey on Interpretable Cross-modal Reasoning

ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning 5 Sep 2023

In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.

13
05 Sep 2023

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

jshilong/gpt4roi 7 Jul 2023

Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence.

452
07 Jul 2023

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

jiwanchung/esper CVPR 2023

Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e. g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity.

22
01 Jan 2023

VASR: Visual Analogies of Situation Recognition

vasr-dataset/vasr 8 Dec 2022

We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies.

3
08 Dec 2022

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

lupantech/ScienceQA 20 Sep 2022

We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.

544
20 Sep 2022

ILLUME: Rationalizing Vision-Language Models through Human Interactions

ml-research/ILLUME 17 Aug 2022

Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering.

5
17 Aug 2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

thunlp/pevl 23 May 2022

We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.

43
23 May 2022

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

intellabs/vl-interpret CVPR 2022

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems.

79
30 Mar 2022

All in One: Exploring Unified Video-Language Pre-training

showlab/all-in-one CVPR 2023

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

272
14 Mar 2022

Joint Answering and Explanation for Visual Commonsense Reasoning

sdlzy/arc 25 Feb 2022

Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset.

5
25 Feb 2022