LAFITE: Towards Language-Free Training for Text-to-Image Generation

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Text-to-Image Generation CUB Lafite FID 10.48 # 6
Inception score 5.97 # 2
Text-to-Image Generation MS COCO Lafite (zero-shot) FID 26.94 # 53
Inception score 26.02 # 14
FID-1 22.97 # 2
FID-8 14.79 # 1
FID-2 18.70 # 2
FID-4 15.72 # 1
Text-to-Image Generation MS COCO Lafite FID 8.12 # 22
Inception score 32.34 # 6
SOA-C 61.09 # 1
Text-to-Image Generation Multi-Modal-CelebA-HQ Lafite FID 12.54 # 2

Methods