Visual Commonsense Reasoning
29 papers with code • 7 benchmarks • 7 datasets
Image source: Visual Commonsense Reasoning
Datasets
Latest papers
Towards artificial general intelligence via a multimodal foundation model
To overcome this limitation and take a solid step towards artificial general intelligence (AGI), we develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks.
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning
Commonsense is defined as the knowledge that is shared by everyone.
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.
Interpretable Visual Understanding with Cognitive Attention Network
While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge.
Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory
Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning.
MERLOT: Multimodal Neural Script Knowledge Models
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future.
Unifying Vision-and-Language Tasks via Text Generation
On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights.
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
TAB-VCR: Tags and Attributes based VCR Baselines
Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets.