Generative Adversarial Transformers

1 Mar 2021  ·  Drew A. Hudson, C. Lawrence Zitnick ·

We introduce the GANsformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis... It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency. Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at read more

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Generation Cityscapes StyleGAN2 FID-10k-training-steps 8.35 # 2
Image Generation Cityscapes VQGAN FID-10k-training-steps 173.7971 # 5
Image Generation Cityscapes SAGAN FID-10k-training-steps 12.8077 # 4
Image Generation Cityscapes GAN FID-10k-training-steps 11.5652 # 3
Image Generation Cityscapes GANsformer FID-10k-training-steps 5.7589 # 1
Image Generation CLEVR StyleGAN2 FID-5k-training-steps 16.0534 # 2
Image Generation CLEVR VQGAN FID-5k-training-steps 32.6031 # 5
Image Generation CLEVR SAGAN FID-5k-training-steps 26.0433 # 4
Image Generation CLEVR GAN FID-5k-training-steps 25.0244 # 3
Image Generation CLEVR GANsformer FID-5k-training-steps 9.1679 # 1
Image Generation FFHQ GANsformer FID-10k-training-steps 12.8478 # 2
Image Generation FFHQ VQGAN FID-10k-training-steps 63.1165 # 5
Image Generation FFHQ SAGAN FID-10k-training-steps 16.2069 # 4
Image Generation FFHQ GAN FID-10k-training-steps 13.1844 # 3
Image Generation FFHQ StyleGAN2 FID-10k-training-steps 10.8309 # 1
Image Generation LSUN Bedroom 256 x 256 GANsformer FID-10k-training-steps 6.5085 # 1
Image Generation LSUN Bedroom 256 x 256 SAGAN FID-10k-training-steps 14.0595 # 4
Image Generation LSUN Bedroom 256 x 256 StyleGAN2 FID-10k-training-steps 11.5255 # 2
Image Generation LSUN Bedroom 256 x 256 VQGAN FID-10k-training-steps 59.6333 # 5
Image Generation LSUN Bedroom 256 x 256 GAN FID-10k-training-steps 12.1567 # 3