TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering	MMBench	DreamLLM-7B	GPT-3.5 score	49.9	# 1
Visual Question Answering	MM-Vet	DreamLLM-7B	GPT-4 score	35.9	# 57
Visual Question Answering	MM-Vet	DreamLLM-7B	Params	7B	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dreamllm-synergistic-multimodal-comprehension/visual-question-answering-on-mmbench)](https://paperswithcode.com/sota/visual-question-answering-on-mmbench?p=dreamllm-synergistic-multimodal-comprehension)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dreamllm-synergistic-multimodal-comprehension/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=dreamllm-synergistic-multimodal-comprehension)`

DreamLLM: Synergistic Multimodal Comprehension and Creation

20 Sep 2023 · Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, HongYu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi ·

This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: https://dreamllm.github.io.

PDF Abstract

Code

Add Remove Mark official

RunpeiDong/DreamLLM official

312

Tasks

Add Remove

multimodal generation

Visual Question Answering

Zero-Shot Learning

Zero-Shot Text-to-Image Generation

Datasets

MS COCO

Visual Question Answering

MMLU

HellaSwag

BoolQ

PIQA

WinoGrande

OK-VQA

TextVQA

VizWiz

MMBench

MM-Vet

SIQA

Image Paragraph Captioning

MMC4

LAION COCO

Results from the Paper

Edit

Ranked #1 on Visual Question Answering on MMBench (GPT-3.5 score metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	MMBench	DreamLLM-7B	GPT-3.5 score	49.9	# 1	Compare
Visual Question Answering	MM-Vet	DreamLLM-7B	GPT-4 score	35.9	# 57	Compare
Visual Question Answering	MM-Vet	DreamLLM-7B	Params	7B	# 1	Compare

Methods

Add Remove

CLIP

Edit Social Preview

DreamLLM: Synergistic Multimodal Comprehension and Creation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove