Visual Commonsense Reasoning
29 papers with code • 7 benchmarks • 7 datasets
Image source: Visual Commonsense Reasoning
Datasets
Latest papers
A Survey on Interpretable Cross-modal Reasoning
In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence.
Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning
Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e. g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity.
VASR: Visual Analogies of Situation Recognition
We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies.
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering.
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems.
All in One: Exploring Unified Video-Language Pre-training
In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.
Joint Answering and Explanation for Visual Commonsense Reasoning
Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset.