InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

PDF Abstract

Results from the Paper


 Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Zero-Shot Transfer Image Classification CN-ImageNet InternVL-C Accuracy (Private) 64.5 # 1
Zero-Shot Cross-Modal Retrieval COCO 2014 InternVL-G Image-to-text R@1 74.9 # 1
Image-to-text R@5 91.3 # 2
Image-to-text R@10 95.2 # 3
Text-to-image R@1 58.6 # 1
Text-to-image R@5 81.3 # 2
Text-to-image R@10 88.0 # 2
Zero-Shot Cross-Modal Retrieval COCO 2014 InternVL-C Image-to-text R@1 70.6 # 4
Image-to-text R@5 89.0 # 6
Image-to-text R@10 93.5 # 6
Text-to-image R@1 54.1 # 3
Text-to-image R@5 77.3 # 4
Text-to-image R@10 84.6 # 4
Zero-shot Image Retrieval COCO-CN InternVL-C R@1 68.9 # 5
R@5 91.9 # 3
R@10 96.5 # 4
Zero-shot Image Retrieval COCO-CN InternVL-G R@1 73.8 # 2
R@5 94.4 # 2
R@10 98.1 # 2
Image-to-Text Retrieval Flickr30k InternVL-C-FT (finetuned, w/o ranking) Recall@1 97.2 # 4
Recall@5 100 # 1
Recall@10 100 # 1
Zero-Shot Cross-Modal Retrieval Flickr30k InternVL-C Image-to-text R@1 94.7 # 3
Image-to-text R@5 99.6 # 3
Image-to-text R@10 99.9 # 2
Text-to-image R@1 81.7 # 4
Text-to-image R@5 96.0 # 4
Text-to-image R@10 98.2 # 3
Zero-Shot Cross-Modal Retrieval Flickr30k InternVL-G Image-to-text R@1 95.7 # 1
Image-to-text R@5 99.7 # 2
Image-to-text R@10 99.9 # 2
Text-to-image R@1 85.0 # 3
Text-to-image R@5 97.0 # 2
Text-to-image R@10 98.6 # 2
Image-to-Text Retrieval Flickr30k InternVL-G-FT (finetuned, w/o ranking) Recall@1 97.9 # 1
Recall@5 100 # 1
Recall@10 100 # 1
Image Retrieval Flickr30k-CN InternVL-G-FT R@1 85.9 # 1
R@5 98.7 # 1
R@10 97.1 # 6
Zero-shot Image Retrieval Flickr30k-CN InternVL-C R@1 75.1 # 3
R@5 92.9 # 3
R@10 96.4 # 3
Zero-shot Image Retrieval Flickr30k-CN InternVL-G R@1 77.7 # 2
R@5 94.8 # 2
R@10 97.3 # 2
Image Retrieval Flickr30k-CN InternVL-C-FT R@1 85.2 # 2
R@5 98.5 # 2
R@10 97.0 # 7
Zero-Shot Transfer Image Classification Food-101 InternVL-C Top 1 Accuracy 95.3 # 3
Zero-Shot Transfer Image Classification ImageNet InternVL-C Accuracy (Private) 83.2 # 11
Zero-Shot Transfer Image Classification ImageNet-A InternVL-C Accuracy (Private) 83.8 # 7
Zero-Shot Transfer Image Classification ImageNet-Sketch InternVL-C Accuracy (Private) 73.9 # 5
Zero-Shot Transfer Image Classification ImageNet V2 InternVL-C Accuracy (Private) 77.3 # 8
Zero-Shot Video Retrieval MSR-VTT-full InternVL-G text-to-video R@1 46.3 # 1
text-to-video R@5 70.5 # 1
text-to-video R@10 79.6 # 1
video-to-text R@1 42.4 # 2
video-to-text R@5 65.9 # 2
video-to-text R@10 75.4 # 2
Zero-Shot Video Retrieval MSR-VTT-full InternVL-C text-to-video R@1 44.7 # 2
text-to-video R@5 68.2 # 2
text-to-video R@10 78.4 # 2
video-to-text R@1 40.2 # 3
video-to-text R@5 63.1 # 3
video-to-text R@10 74.1 # 3
Zero-Shot Transfer Image Classification ObjectNet InternVL-C Accuracy (Private) 80.6 # 6
Zero-shot Image Retrieval XTD10 InternVL-G EN-Recall@10 98.6 # 1
ES-Recall@10 97.7 # 1
FR-Recall@10 96.5 # 1
ZH-Recall@10 96.7 # 1
KO-Recall@10 95.1 # 1
RU-Recall@10 94.8 # 1
JA-Recall@10 96.1 # 1
IT-Recall@10 96.9 # 1
Zero-shot Image Retrieval XTD10 InternVL-C EN-Recall@10 97.3 # 2
ES-Recall@10 95.7 # 2
FR-Recall@10 95.1 # 2
ZH-Recall@10 95.6 # 2
KO-Recall@10 92.2 # 3
RU-Recall@10 93.3 # 2
JA-Recall@10 95.5 # 2
IT-Recall@10 96.0 # 2

Methods


No methods listed for this paper. Add relevant methods here