TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Image Generation	CUB	Lafite	FID	10.48	# 6
Text-to-Image Generation	CUB	Lafite	Inception score	5.97	# 2
Text-to-Image Generation	MS COCO	Lafite (zero-shot)	FID	26.94	# 53
Text-to-Image Generation	MS COCO	Lafite (zero-shot)	Inception score	26.02	# 14
Text-to-Image Generation	MS COCO	Lafite (zero-shot)	FID-1	22.97	# 2
Text-to-Image Generation	MS COCO	Lafite (zero-shot)	FID-8	14.79	# 1
Text-to-Image Generation	MS COCO	Lafite (zero-shot)	FID-2	18.70	# 2
Text-to-Image Generation	MS COCO	Lafite (zero-shot)	FID-4	15.72	# 1
Text-to-Image Generation	MS COCO	Lafite	FID	8.12	# 22
Text-to-Image Generation	MS COCO	Lafite	Inception score	32.34	# 6
Text-to-Image Generation	MS COCO	Lafite	SOA-C	61.09	# 1
Text-to-Image Generation	Multi-Modal-CelebA-HQ	Lafite	FID	12.54	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lafite-towards-language-free-training-for/text-to-image-generation-on-multi-modal)](https://paperswithcode.com/sota/text-to-image-generation-on-multi-modal?p=lafite-towards-language-free-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lafite-towards-language-free-training-for/text-to-image-generation-on-cub)](https://paperswithcode.com/sota/text-to-image-generation-on-cub?p=lafite-towards-language-free-training-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lafite-towards-language-free-training-for/text-to-image-generation-on-coco)](https://paperswithcode.com/sota/text-to-image-generation-on-coco?p=lafite-towards-language-free-training-for)`

LAFITE: Towards Language-Free Training for Text-to-Image Generation

27 Nov 2021 · Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, Tong Sun ·

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

PDF Abstract

Code

Add Remove Mark official

drboog/Lafite official

↳ Quickstart in

Colab

176

drboog/Shifted_Diffusion

154

Tasks

Add Remove

Image Generation

Text-to-Image Generation

Zero-Shot Text-to-Image Generation

Datasets

MS COCO

CUB-200-2011

Visual Question Answering

CelebA-HQ

Conceptual Captions

Multi-Modal CelebA-HQ

Results from the Paper

Edit

Ranked #2 on Text-to-Image Generation on Multi-Modal-CelebA-HQ

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Image Generation	CUB	Lafite	FID	10.48	# 6	Compare
Text-to-Image Generation	CUB	Lafite	Inception score	5.97	# 2	Compare
Text-to-Image Generation	MS COCO	Lafite (zero-shot)	FID	26.94	# 53	Compare
			Inception score	26.02	# 14	Compare
			FID-1	22.97	# 2	Compare
			FID-8	14.79	# 1	Compare
			FID-2	18.70	# 2	Compare
			FID-4	15.72	# 1	Compare
Text-to-Image Generation	MS COCO	Lafite	FID	8.12	# 22	Compare
			Inception score	32.34	# 6	Compare
			SOA-C	61.09	# 1	Compare
Text-to-Image Generation	Multi-Modal-CelebA-HQ	Lafite	FID	12.54	# 2	Compare

Methods

Add Remove

CLIP

Edit Social Preview

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove