TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@1	67.0	# 8
Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@5	91.4	# 8
Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@10	96.7	# 9
Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@1	74.0	# 7
Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@5	94.4	# 6
Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@10	98.1	# 6
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@1	49.2	# 11
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@5	79.4	# 12
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@10	87.9	# 12
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@1	53.4	# 10
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@5	80.2	# 11
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@10	90.1	# 11
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@1	45.7	# 13
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@5	73.8	# 13
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@10	82.2	# 13
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@1	51.7	# 11
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@5	78.9	# 11
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@10	86.3	# 11
Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@1	77.4	# 9
Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@5	94.5	# 9
Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@10	97.0	# 7
Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@1	67.6	# 10
Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@5	89.6	# 10
Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@10	94.2	# 10
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@1	33.4	# 8
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@5	59.3	# 8
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@10	69.7	# 8
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	Mean Recall	54.1	# 8
Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@1	39.2	# 9
Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@5	66.9	# 9
Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@10	77.4	# 9
Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	Mean Recall	61.2	# 9
Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@1	52.7	# 6
Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@5	77.9	# 6
Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@10	85.6	# 6
Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	Mean Recall	72.1	# 6
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@1	42.7	# 6
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@5	69.0	# 6
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@10	78.0	# 6
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	Mean Recall	63.2	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wukong-100-million-large-scale-chinese-cross/image-retrieval-on-muge-retrieval)](https://paperswithcode.com/sota/image-retrieval-on-muge-retrieval?p=wukong-100-million-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wukong-100-million-large-scale-chinese-cross/zero-shot-image-retrieval-on-muge-retrieval)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-muge-retrieval?p=wukong-100-million-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wukong-100-million-large-scale-chinese-cross/image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/image-retrieval-on-coco-cn?p=wukong-100-million-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wukong-100-million-large-scale-chinese-cross/image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/image-retrieval-on-flickr30k-cn?p=wukong-100-million-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wukong-100-million-large-scale-chinese-cross/zero-shot-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=wukong-100-million-large-scale-chinese-cross)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wukong-100-million-large-scale-chinese-cross/zero-shot-image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-flickr30k-cn?p=wukong-100-million-large-scale-chinese-cross)`

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

14 Feb 2022 · Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe Niu, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei zhang, Xin Jiang, Chunjing Xu, Hang Xu ·

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.

PDF Abstract

Code

Add Remove Mark official

0jason000/wukong

Tasks

Add Remove

Benchmarking

Contrastive Learning

Image Classification

Image Retrieval

Retrieval

Text Retrieval

Zero-Shot Image Classification

Zero-shot Image Retrieval

Datasets

Introduced in the Paper:

Wukong

Used in the Paper:

ImageNet

Flickr30k

YFCC100M

LAION-400M

JFT-300M

CC12M

COCO-CN

Flickr30k-CNA

Results from the Paper

Edit

Ranked #6 on Image Retrieval on MUGE Retrieval

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@1	67.0	# 8	Compare
			R@5	91.4	# 8	Compare
			R@10	96.7	# 9	Compare
Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@1	74.0	# 7	Compare
			R@5	94.4	# 6	Compare
			R@10	98.1	# 6	Compare
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-B/32)	R@1	49.2	# 11	Compare
			R@5	79.4	# 12	Compare
			R@10	87.9	# 12	Compare
Zero-shot Image Retrieval	COCO-CN	Wukong (ViT-L/14)	R@1	53.4	# 10	Compare
			R@5	80.2	# 11	Compare
			R@10	90.1	# 11	Compare
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@1	45.7	# 13	Compare
			R@5	73.8	# 13	Compare
			R@10	82.2	# 13	Compare
Zero-shot Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@1	51.7	# 11	Compare
			R@5	78.9	# 11	Compare
			R@10	86.3	# 11	Compare
Image Retrieval	Flickr30k-CN	Wukong (ViT-L/14)	R@1	77.4	# 9	Compare
			R@5	94.5	# 9	Compare
			R@10	97.0	# 7	Compare
Image Retrieval	Flickr30k-CN	Wukong (ViT-B/32)	R@1	67.6	# 10	Compare
			R@5	89.6	# 10	Compare
			R@10	94.2	# 10	Compare
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@1	33.4	# 8	Compare
			R@5	59.3	# 8	Compare
			R@10	69.7	# 8	Compare
			Mean Recall	54.1	# 8	Compare
Image Retrieval	MUGE Retrieval	Wukong (ViT-B/32)	R@1	39.2	# 9	Compare
			R@5	66.9	# 9	Compare
			R@10	77.4	# 9	Compare
			Mean Recall	61.2	# 9	Compare
Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@1	52.7	# 6	Compare
			R@5	77.9	# 6	Compare
			R@10	85.6	# 6	Compare
			Mean Recall	72.1	# 6	Compare
Zero-shot Image Retrieval	MUGE Retrieval	Wukong (ViT-L/14)	R@1	42.7	# 6	Compare
			R@5	69.0	# 6	Compare
			R@10	78.0	# 6	Compare
			Mean Recall	63.2	# 6	Compare

Methods

Add Remove

CLIP • WenLan

Edit Social Preview

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove