Variational Causal Inference Network for Explanatory Visual Question Answering

ICCV 2023  ·  Dizhan Xue, Shengsheng Qian, Changsheng Xu ·

Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross-modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state-of-the-art EVQA methods.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Explanatory Visual Question Answering GQA-REX VCIN BLEU-4 58.65 # 1
METEOR 41.57 # 1
ROUGE-L 81.45 # 1
CIDEr 519.23 # 1
SPICE 54.63 # 1
Grounding 77.33 # 1
GQA-val 81.80 # 1
GQA-test 60.61 # 1

Methods