VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

28 Sep 2020  ·  Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Jianfeng Gao, Zicheng Liu ·

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other thanCOCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning nocaps entire Microsoft Cognitive Services team CIDEr 114.25 # 4
B1 85.62 # 3
B2 71.36 # 3
B3 53.62 # 3
B4 34.65 # 3
ROUGE-L 61.2 # 3
METEOR 31.27 # 3
SPICE 14.85 # 4
Image Captioning nocaps in-domain Microsoft Cognitive Services team CIDEr 112.82 # 6
B1 86.33 # 5
B2 72.83 # 5
B3 55.94 # 5
B4 37.97 # 5
ROUGE-L 62.48 # 5
METEOR 32.7 # 5
SPICE 15.22 # 5
Image Captioning nocaps near-domain Microsoft Cognitive Services team CIDEr 115.54 # 5
B1 86.48 # 5
B2 72.6 # 5
B3 55.26 # 5
B4 36.31 # 5
ROUGE-L 61.9 # 5
METEOR 31.8 # 5
SPICE 15.06 # 6
Image Captioning nocaps out-of-domain Microsoft Cognitive Services team CIDEr 110.14 # 5
B1 81.73 # 5
B2 65.48 # 5
B3 45.58 # 5
B4 25.78 # 5
ROUGE-L 57.57 # 5
METEOR 28.17 # 5
SPICE 13.74 # 10
Image Captioning nocaps-XD entire Microsoft Cognitive Services team CIDEr 100.12 # 5
B1 82.27 # 5
B2 66.04 # 5
B3 47.48 # 5
B4 28.95 # 5
ROUGE-L 58.26 # 5
METEOR 29.47 # 5
SPICE 14.04 # 6
Image Captioning nocaps-XD in-domain Microsoft Cognitive Services team CIDEr 100.62 # 4
B1 82.94 # 4
B2 67.56 # 4
B3 49.66 # 4
B4 32.07 # 4
ROUGE-L 59.43 # 4
METEOR 30.62 # 4
SPICE 14.7 # 5
Image Captioning nocaps-XD near-domain Microsoft Cognitive Services team CIDEr 101.2 # 4
B1 82.88 # 4
B2 67.01 # 4
B3 48.73 # 4
B4 30.21 # 4
ROUGE-L 58.76 # 4
METEOR 30.0 # 4
SPICE 14.27 # 5
Image Captioning nocaps-XD out-of-domain Microsoft Cognitive Services team CIDEr 95.5 # 3
B1 79.44 # 4
B2 61.15 # 3
B3 41.03 # 3
B4 21.79 # 3
ROUGE-L 55.49 # 3
METEOR 26.56 # 4
SPICE 12.66 # 5

Methods