TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Transfer Image Classification	CN-ImageNet	InternVL-C	Accuracy (Private)	64.5	# 1
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-G	Image-to-text R@1	74.9	# 1
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-G	Image-to-text R@5	91.3	# 2
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-G	Image-to-text R@10	95.2	# 3
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-G	Text-to-image R@1	58.6	# 1
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-G	Text-to-image R@5	81.3	# 2
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-G	Text-to-image R@10	88.0	# 2
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-C	Image-to-text R@1	70.6	# 4
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-C	Image-to-text R@5	89.0	# 6
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-C	Image-to-text R@10	93.5	# 6
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-C	Text-to-image R@1	54.1	# 3
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-C	Text-to-image R@5	77.3	# 4
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-C	Text-to-image R@10	84.6	# 4
Zero-shot Image Retrieval	COCO-CN	InternVL-C	R@1	68.9	# 5
Zero-shot Image Retrieval	COCO-CN	InternVL-C	R@5	91.9	# 3
Zero-shot Image Retrieval	COCO-CN	InternVL-C	R@10	96.5	# 4
Zero-shot Image Retrieval	COCO-CN	InternVL-G	R@1	73.8	# 2
Zero-shot Image Retrieval	COCO-CN	InternVL-G	R@5	94.4	# 2
Zero-shot Image Retrieval	COCO-CN	InternVL-G	R@10	98.1	# 2
Image-to-Text Retrieval	Flickr30k	InternVL-C-FT (finetuned, w/o ranking)	Recall@1	97.2	# 4
Image-to-Text Retrieval	Flickr30k	InternVL-C-FT (finetuned, w/o ranking)	Recall@5	100	# 1
Image-to-Text Retrieval	Flickr30k	InternVL-C-FT (finetuned, w/o ranking)	Recall@10	100	# 1
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-C	Image-to-text R@1	94.7	# 3
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-C	Image-to-text R@5	99.6	# 3
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-C	Image-to-text R@10	99.9	# 2
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-C	Text-to-image R@1	81.7	# 4
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-C	Text-to-image R@5	96.0	# 4
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-C	Text-to-image R@10	98.2	# 3
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-G	Image-to-text R@1	95.7	# 1
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-G	Image-to-text R@5	99.7	# 2
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-G	Image-to-text R@10	99.9	# 2
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-G	Text-to-image R@1	85.0	# 3
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-G	Text-to-image R@5	97.0	# 2
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-G	Text-to-image R@10	98.6	# 2
Image-to-Text Retrieval	Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	Recall@1	97.9	# 1
Image-to-Text Retrieval	Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	Recall@5	100	# 1
Image-to-Text Retrieval	Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	Recall@10	100	# 1
Image Retrieval	Flickr30k-CN	InternVL-G-FT	R@1	85.9	# 1
Image Retrieval	Flickr30k-CN	InternVL-G-FT	R@5	98.7	# 1
Image Retrieval	Flickr30k-CN	InternVL-G-FT	R@10	97.1	# 6
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-C	R@1	75.1	# 3
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-C	R@5	92.9	# 3
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-C	R@10	96.4	# 3
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-G	R@1	77.7	# 2
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-G	R@5	94.8	# 2
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-G	R@10	97.3	# 2
Image Retrieval	Flickr30k-CN	InternVL-C-FT	R@1	85.2	# 2
Image Retrieval	Flickr30k-CN	InternVL-C-FT	R@5	98.5	# 2
Image Retrieval	Flickr30k-CN	InternVL-C-FT	R@10	97.0	# 7
Zero-Shot Transfer Image Classification	Food-101	InternVL-C	Top 1 Accuracy	95.3	# 3
Zero-Shot Transfer Image Classification	ImageNet	InternVL-C	Accuracy (Private)	83.2	# 11
Zero-Shot Transfer Image Classification	ImageNet-A	InternVL-C	Accuracy (Private)	83.8	# 7
Zero-Shot Transfer Image Classification	ImageNet-Sketch	InternVL-C	Accuracy (Private)	73.9	# 5
Zero-Shot Transfer Image Classification	ImageNet V2	InternVL-C	Accuracy (Private)	77.3	# 8
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-G	text-to-video R@1	46.3	# 1
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-G	text-to-video R@5	70.5	# 1
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-G	text-to-video R@10	79.6	# 1
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-G	video-to-text R@1	42.4	# 2
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-G	video-to-text R@5	65.9	# 2
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-G	video-to-text R@10	75.4	# 2
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-C	text-to-video R@1	44.7	# 2
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-C	text-to-video R@5	68.2	# 2
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-C	text-to-video R@10	78.4	# 2
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-C	video-to-text R@1	40.2	# 3
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-C	video-to-text R@5	63.1	# 3
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-C	video-to-text R@10	74.1	# 3
Zero-Shot Transfer Image Classification	ObjectNet	InternVL-C	Accuracy (Private)	80.6	# 6
Zero-shot Image Retrieval	XTD10	InternVL-G	EN-Recall@10	98.6	# 1
Zero-shot Image Retrieval	XTD10	InternVL-G	ES-Recall@10	97.7	# 1
Zero-shot Image Retrieval	XTD10	InternVL-G	FR-Recall@10	96.5	# 1
Zero-shot Image Retrieval	XTD10	InternVL-G	ZH-Recall@10	96.7	# 1
Zero-shot Image Retrieval	XTD10	InternVL-G	KO-Recall@10	95.1	# 1
Zero-shot Image Retrieval	XTD10	InternVL-G	RU-Recall@10	94.8	# 1
Zero-shot Image Retrieval	XTD10	InternVL-G	JA-Recall@10	96.1	# 1
Zero-shot Image Retrieval	XTD10	InternVL-G	IT-Recall@10	96.9	# 1
Zero-shot Image Retrieval	XTD10	InternVL-C	EN-Recall@10	97.3	# 2
Zero-shot Image Retrieval	XTD10	InternVL-C	ES-Recall@10	95.7	# 2
Zero-shot Image Retrieval	XTD10	InternVL-C	FR-Recall@10	95.1	# 2
Zero-shot Image Retrieval	XTD10	InternVL-C	ZH-Recall@10	95.6	# 2
Zero-shot Image Retrieval	XTD10	InternVL-C	KO-Recall@10	92.2	# 3
Zero-shot Image Retrieval	XTD10	InternVL-C	RU-Recall@10	93.3	# 2
Zero-shot Image Retrieval	XTD10	InternVL-C	JA-Recall@10	95.5	# 2
Zero-shot Image Retrieval	XTD10	InternVL-C	IT-Recall@10	96.0	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-cn)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-cn?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-to-text-retrieval-on-flickr30k)](https://paperswithcode.com/sota/image-to-text-retrieval-on-flickr30k?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-video-retrieval-on-msr-vtt-full)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt-full?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-xtd10)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-xtd10?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-image-retrieval-on-flickr30k-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-flickr30k-cn?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-17)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-17?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-8)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-8?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-6)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-6?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-5)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-3)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=internvl-scaling-up-vision-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvl-scaling-up-vision-foundation-models/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=internvl-scaling-up-vision-foundation-models)`

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

21 Dec 2023 · Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai ·

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

PDF Abstract

Code

Add Remove Mark official

opengvlab/internvl official

895

opengvlab/internvl-mmdetseg

Tasks

Add Remove

Image Retrieval

Image-to-Text Retrieval

Language Modelling

Large Language Model

Retrieval

Text Retrieval

Video Classification

Video-Text Retrieval

Zero-Shot Cross-Modal Retrieval

Zero-shot Image Retrieval

Zero-Shot Transfer Image Classification

Zero-Shot Video Retrieval

Datasets

CIFAR-10

ImageNet

MS COCO

CIFAR-100

MNIST

Visual Question Answering

Oxford 102 Flower

ADE20K

STL-10

Flickr30k

Stanford Cars

DTD

Food-101

Caltech-101

MSR-VTT

EuroSAT

GQA

FGVC-Aircraft

ImageNet-R

ImageNet-A

GTSRB

OK-VQA

ImageNet-Sketch

TextVQA

VisDial

NoCaps

VizWiz

FER2013

LAION-5B

ObjectNet

RESISC45

CC12M

A-OKVQA

Kinetics-700

ST-VQA ChartQA

Birdsnap TextCaps

AI2D JFT-3B

IconQA

COCO-CN

Wukong

Flickr30k-CNA XTD10

Results from the Paper

Add Remove

Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Transfer Image Classification	CN-ImageNet	InternVL-C	Accuracy (Private)	64.5	# 1	Compare
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-G	Image-to-text R@1	74.9	# 1	Compare
			Image-to-text R@5	91.3	# 2	Compare
			Image-to-text R@10	95.2	# 3	Compare
			Text-to-image R@1	58.6	# 1	Compare
			Text-to-image R@5	81.3	# 2	Compare
			Text-to-image R@10	88.0	# 2	Compare
Zero-Shot Cross-Modal Retrieval	COCO 2014	InternVL-C	Image-to-text R@1	70.6	# 4	Compare
			Image-to-text R@5	89.0	# 6	Compare
			Image-to-text R@10	93.5	# 6	Compare
			Text-to-image R@1	54.1	# 3	Compare
			Text-to-image R@5	77.3	# 4	Compare
			Text-to-image R@10	84.6	# 4	Compare
Zero-shot Image Retrieval	COCO-CN	InternVL-C	R@1	68.9	# 5	Compare
			R@5	91.9	# 3	Compare
			R@10	96.5	# 4	Compare
Zero-shot Image Retrieval	COCO-CN	InternVL-G	R@1	73.8	# 2	Compare
			R@5	94.4	# 2	Compare
			R@10	98.1	# 2	Compare
Image-to-Text Retrieval	Flickr30k	InternVL-C-FT (finetuned, w/o ranking)	Recall@1	97.2	# 4	Compare
			Recall@5	100	# 1	Compare
			Recall@10	100	# 1	Compare
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-C	Image-to-text R@1	94.7	# 3	Compare
			Image-to-text R@5	99.6	# 3	Compare
			Image-to-text R@10	99.9	# 2	Compare
			Text-to-image R@1	81.7	# 4	Compare
			Text-to-image R@5	96.0	# 4	Compare
			Text-to-image R@10	98.2	# 3	Compare
Zero-Shot Cross-Modal Retrieval	Flickr30k	InternVL-G	Image-to-text R@1	95.7	# 1	Compare
			Image-to-text R@5	99.7	# 2	Compare
			Image-to-text R@10	99.9	# 2	Compare
			Text-to-image R@1	85.0	# 3	Compare
			Text-to-image R@5	97.0	# 2	Compare
			Text-to-image R@10	98.6	# 2	Compare
Image-to-Text Retrieval	Flickr30k	InternVL-G-FT (finetuned, w/o ranking)	Recall@1	97.9	# 1	Compare
			Recall@5	100	# 1	Compare
			Recall@10	100	# 1	Compare
Image Retrieval	Flickr30k-CN	InternVL-G-FT	R@1	85.9	# 1	Compare
			R@5	98.7	# 1	Compare
			R@10	97.1	# 6	Compare
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-C	R@1	75.1	# 3	Compare
			R@5	92.9	# 3	Compare
			R@10	96.4	# 3	Compare
Zero-shot Image Retrieval	Flickr30k-CN	InternVL-G	R@1	77.7	# 2	Compare
			R@5	94.8	# 2	Compare
			R@10	97.3	# 2	Compare
Image Retrieval	Flickr30k-CN	InternVL-C-FT	R@1	85.2	# 2	Compare
			R@5	98.5	# 2	Compare
			R@10	97.0	# 7	Compare
Zero-Shot Transfer Image Classification	Food-101	InternVL-C	Top 1 Accuracy	95.3	# 3	Compare
Zero-Shot Transfer Image Classification	ImageNet	InternVL-C	Accuracy (Private)	83.2	# 11	Compare
Zero-Shot Transfer Image Classification	ImageNet-A	InternVL-C	Accuracy (Private)	83.8	# 7	Compare
Zero-Shot Transfer Image Classification	ImageNet-Sketch	InternVL-C	Accuracy (Private)	73.9	# 5	Compare
Zero-Shot Transfer Image Classification	ImageNet V2	InternVL-C	Accuracy (Private)	77.3	# 8	Compare
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-G	text-to-video R@1	46.3	# 1	Compare
			text-to-video R@5	70.5	# 1	Compare
			text-to-video R@10	79.6	# 1	Compare
			video-to-text R@1	42.4	# 2	Compare
			video-to-text R@5	65.9	# 2	Compare
			video-to-text R@10	75.4	# 2	Compare
Zero-Shot Video Retrieval	MSR-VTT-full	InternVL-C	text-to-video R@1	44.7	# 2	Compare
			text-to-video R@5	68.2	# 2	Compare
			text-to-video R@10	78.4	# 2	Compare
			video-to-text R@1	40.2	# 3	Compare
			video-to-text R@5	63.1	# 3	Compare
			video-to-text R@10	74.1	# 3	Compare
Zero-Shot Transfer Image Classification	ObjectNet	InternVL-C	Accuracy (Private)	80.6	# 6	Compare
Zero-shot Image Retrieval	XTD10	InternVL-G	EN-Recall@10	98.6	# 1	Compare
			ES-Recall@10	97.7	# 1	Compare
			FR-Recall@10	96.5	# 1	Compare
			ZH-Recall@10	96.7	# 1	Compare
			KO-Recall@10	95.1	# 1	Compare
			RU-Recall@10	94.8	# 1	Compare
			JA-Recall@10	96.1	# 1	Compare
			IT-Recall@10	96.9	# 1	Compare
Zero-shot Image Retrieval	XTD10	InternVL-C	EN-Recall@10	97.3	# 2	Compare
			ES-Recall@10	95.7	# 2	Compare
			FR-Recall@10	95.1	# 2	Compare
			ZH-Recall@10	95.6	# 2	Compare
			KO-Recall@10	92.2	# 3	Compare
			RU-Recall@10	93.3	# 2	Compare
			JA-Recall@10	95.5	# 2	Compare
			IT-Recall@10	96.0	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove