Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Image Captioning COCO Captions test Unified VLP BLEU-4 36.5 # 1
CIDEr 116.9 # 1
METEOR 28.4 # 1
SPICE 21.2 # 1
Image Captioning Flickr30k Captions test Unified VLP BLEU-4 30.1 # 1
CIDEr 67.4 # 1
METEOR 23 # 1
SPICE 17 # 1
Visual Question Answering VQA v2 test-std Unified VLP overall 70.7 # 44

Methods used in the Paper


METHOD TYPE
Residual Connection
Skip Connections
BPE
Subword Segmentation
Dense Connections
Feedforward Networks
Label Smoothing
Regularization
ReLU
Activation Functions
Adam
Stochastic Optimization
Softmax
Output Functions
Dropout
Regularization
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers