Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Retrieval COCO-CN Wukong (ViT-B/32) R@1 67.0 # 8
R@5 91.4 # 8
R@10 96.7 # 9
Image Retrieval COCO-CN Wukong (ViT-L/14) R@1 74.0 # 7
R@5 94.4 # 6
R@10 98.1 # 6
Zero-shot Image Retrieval COCO-CN Wukong (ViT-B/32) R@1 49.2 # 11
R@5 79.4 # 12
R@10 87.9 # 12
Zero-shot Image Retrieval COCO-CN Wukong (ViT-L/14) R@1 53.4 # 10
R@5 80.2 # 11
R@10 90.1 # 11
Zero-shot Image Retrieval Flickr30k-CN Wukong (ViT-B/32) R@1 45.7 # 13
R@5 73.8 # 13
R@10 82.2 # 13
Zero-shot Image Retrieval Flickr30k-CN Wukong (ViT-L/14) R@1 51.7 # 11
R@5 78.9 # 11
R@10 86.3 # 11
Image Retrieval Flickr30k-CN Wukong (ViT-L/14) R@1 77.4 # 9
R@5 94.5 # 9
R@10 97.0 # 7
Image Retrieval Flickr30k-CN Wukong (ViT-B/32) R@1 67.6 # 10
R@5 89.6 # 10
R@10 94.2 # 10
Zero-shot Image Retrieval MUGE Retrieval Wukong (ViT-B/32) R@1 33.4 # 8
R@5 59.3 # 8
R@10 69.7 # 8
Mean Recall 54.1 # 8
Image Retrieval MUGE Retrieval Wukong (ViT-B/32) R@1 39.2 # 9
R@5 66.9 # 9
R@10 77.4 # 9
Mean Recall 61.2 # 9
Image Retrieval MUGE Retrieval Wukong (ViT-L/14) R@1 52.7 # 6
R@5 77.9 # 6
R@10 85.6 # 6
Mean Recall 72.1 # 6
Zero-shot Image Retrieval MUGE Retrieval Wukong (ViT-L/14) R@1 42.7 # 6
R@5 69.0 # 6
R@10 78.0 # 6
Mean Recall 63.2 # 6

Methods