Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Ranked #1 on
Visual Reasoning
on NLVR2 Test
LANGUAGE MODELLING QUESTION ANSWERING REFERRING EXPRESSION COMPREHENSION REPRESENTATION LEARNING TEXT MATCHING VISUAL COMMONSENSE REASONING VISUAL ENTAILMENT VISUAL QUESTION ANSWERING
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
QUESTION ANSWERING REFERRING EXPRESSION COMPREHENSION REPRESENTATION LEARNING VISUAL COMMONSENSE REASONING VISUAL ENTAILMENT VISUAL QUESTION ANSWERING
We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task.
Ranked #1 on
Visual Entailment
on SNLI-VE val
NATURAL LANGUAGE INFERENCE QUESTION ANSWERING VISUAL ENTAILMENT VISUAL QUESTION ANSWERING VISUAL REASONING
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks.
NATURAL LANGUAGE INFERENCE QUESTION ANSWERING VISUAL ENTAILMENT VISUAL QUESTION ANSWERING