Visual Commonsense Reasoning
29 papers with code • 7 benchmarks • 7 datasets
Image source: Visual Commonsense Reasoning
Datasets
Most implemented papers
Joint Answering and Explanation for Visual Commonsense Reasoning
Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset.
All in One: Exploring Unified Video-Language Pre-training
In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems.
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering.
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.
VASR: Visual Analogies of Situation Recognition
We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies.
Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning
Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e. g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity.
A Survey on Interpretable Cross-modal Reasoning
In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.