TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	Accuracy	61.2	# 1
Video Question Answering	ActivityNet-QA	GPT-2 + CLIP-32 (Zero-Shot)	Accuracy	58.4	# 2
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (fine-tuned, BS=5)	Accuracy	18.3	# 139
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (fine-tuned, BS=5)	Parameters (Billion)	0.355	# 2
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (BS=5)	Accuracy	12.2	# 146
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (BS=5)	Parameters (Billion)	0.355	# 2
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=5)	Accuracy	20.8	# 137
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=5)	Parameters (Billion)	0.355	# 2
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=1)	Accuracy	16.8	# 144
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=1)	Parameters (Billion)	0.355	# 2
Image Generation	ImageNet 64x64	GLIDE + CLS	Inception Score	22.077	# 8
Image Generation	ImageNet 64x64	GLIDE + CLS	FID	30.871	# 17
Image Generation	ImageNet 64x64	GLIDE + CLIP + CLS + CLS-FREE	Inception Score	34.952	# 5
Image Generation	ImageNet 64x64	GLIDE + CLIP + CLS + CLS-FREE	FID	29.184	# 14
Image Generation	ImageNet 64x64	GLIDE + CLIP + CLS + CLS-FREE	KID	3.766	# 1
Image Generation	ImageNet 64x64	GLIDE + CLS-FREE	Inception Score	25.926	# 6
Image Generation	ImageNet 64x64	GLIDE + CLS-FREE	FID	29.219	# 15
Image Generation	ImageNet 64x64	GLIDE + CLS-FREE	KID	5.325	# 2
Image Generation	ImageNet 64x64	GLIDE +CLS	KID	7.952	# 4
Image Generation	ImageNet 64x64	GLIDE + CLIP	Inception Score	25.017	# 7
Image Generation	ImageNet 64x64	GLIDE + CLIP	FID	30.462	# 16
Image Generation	ImageNet 64x64	GLIDE + CLIP	KID	6.174	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/composing-ensembles-of-pre-trained-models-via/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=composing-ensembles-of-pre-trained-models-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/composing-ensembles-of-pre-trained-models-via/image-generation-on-imagenet-64x64)](https://paperswithcode.com/sota/image-generation-on-imagenet-64x64?p=composing-ensembles-of-pre-trained-models-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/composing-ensembles-of-pre-trained-models-via/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=composing-ensembles-of-pre-trained-models-via)`

Composing Ensembles of Pre-trained Models via Iterative Consensus

20 Oct 2022 · Shuang Li, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Igor Mordatch ·

Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models -- combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation. Project page: https://energy-based-model.github.io/composing-pretrained-models.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Arithmetic Reasoning

Image Generation

Math

Mathematical Reasoning

Question Answering

Video Question Answering

Datasets

ImageNet

GSM8K

WebText

ActivityNet-QA

Results from the Paper

Edit

Ranked #1 on Video Question Answering on ActivityNet-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	Accuracy	61.2	# 1	Compare
Video Question Answering	ActivityNet-QA	GPT-2 + CLIP-32 (Zero-Shot)	Accuracy	58.4	# 2	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (fine-tuned, BS=5)	Accuracy	18.3	# 139	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (fine-tuned, BS=5)	Parameters (Billion)	0.355	# 2	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (BS=5)	Accuracy	12.2	# 146	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M (BS=5)	Parameters (Billion)	0.355	# 2	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=5)	Accuracy	20.8	# 137	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=5)	Parameters (Billion)	0.355	# 2	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=1)	Accuracy	16.8	# 144	Compare
Arithmetic Reasoning	GSM8K	GPT-2-Medium 355M + question-solution classifier (BS=1)	Parameters (Billion)	0.355	# 2	Compare
Image Generation	ImageNet 64x64	GLIDE + CLS	Inception Score	22.077	# 8	Compare
Image Generation	ImageNet 64x64	GLIDE + CLS	FID	30.871	# 17	Compare
Image Generation	ImageNet 64x64	GLIDE + CLIP + CLS + CLS-FREE	Inception Score	34.952	# 5	Compare
			FID	29.184	# 14	Compare
			KID	3.766	# 1	Compare
Image Generation	ImageNet 64x64	GLIDE + CLS-FREE	Inception Score	25.926	# 6	Compare
			FID	29.219	# 15	Compare
			KID	5.325	# 2	Compare
Image Generation	ImageNet 64x64	GLIDE +CLS	KID	7.952	# 4	Compare
Image Generation	ImageNet 64x64	GLIDE + CLIP	Inception Score	25.017	# 7	Compare
			FID	30.462	# 16	Compare
			KID	6.174	# 3	Compare

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

Composing Ensembles of Pre-trained Models via Iterative Consensus

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove