M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.
PDF AbstractResults from the Paper
Ranked #1 on Zero-shot Image Retrieval on Flickr30k-CN (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Zero-Shot Cross-Modal Retrieval | COCO 2014 | M2-Encoder | Image-to-text R@1 | 72.8 | # 2 | ||
Image-to-text R@5 | 92.3 | # 1 | |||||
Image-to-text R@10 | 96.3 | # 1 | |||||
Text-to-image R@1 | 56.5 | # 2 | |||||
Text-to-image R@5 | 81.6 | # 1 | |||||
Text-to-image R@10 | 88.8 | # 1 | |||||
Zero-shot Text-to-Image Retrieval | COCO-CN | M2-Encoder | Recall@1 | 78.7 | # 1 | ||
Recall@5 | 96.0 | # 1 | |||||
Recall@10 | 98.7 | # 1 | |||||
Zero-shot Image Retrieval | COCO-CN | M2-Encoder | R@1 | 78.7 | # 1 | ||
R@5 | 96.0 | # 1 | |||||
R@10 | 98.7 | # 1 | |||||
Zero-Shot Cross-Modal Retrieval | Flickr30k | M2-Encoder | Image-to-text R@1 | 91.2 | # 6 | ||
Image-to-text R@5 | 99.2 | # 7 | |||||
Image-to-text R@10 | 99.6 | # 11 | |||||
Text-to-image R@1 | 92.2 | # 1 | |||||
Text-to-image R@5 | 99.5 | # 1 | |||||
Text-to-image R@10 | 99.7 | # 1 | |||||
Zero-shot Image Retrieval | Flickr30k-CN | M2-Encoder | R@1 | 81.5 | # 1 | ||
R@5 | 96.2 | # 1 | |||||
R@10 | 98.5 | # 1 | |||||
Zero-Shot Transfer Image Classification | ImageNet | M2-Encoder | Accuracy (Private) | 88.5 | # 1 | ||
Zero-Shot Learning | ImageNet_CN | $M^2$-Encoder | Accuracy | 80.7 | # 1 |