Photographic Image Synthesis with Cascaded Refinement Networks

ICCV 2017  ·  Qifeng Chen, Vladlen Koltun ·

We present an approach to synthesizing photographic images conditioned on semantic layouts. Given a semantic label map, our approach produces an image with photographic appearance that conforms to the input layout. The approach thus functions as a rendering engine that takes a two-dimensional semantic specification of the scene and produces a corresponding photographic image. Unlike recent and contemporaneous work, our approach does not rely on adversarial training. We show that photographic images can be synthesized from semantic layouts by a single feedforward network with appropriate structure, trained end-to-end with a direct regression objective. The presented approach scales seamlessly to high resolutions; we demonstrate this by synthesizing photographic images at 2-megapixel resolution, the full resolution of our training data. Extensive perceptual experiments on datasets of outdoor and indoor scenes demonstrate that images synthesized by the presented approach are considerably more realistic than alternative approaches. The results are shown in the supplementary video at https://youtu.be/0fhUJT21-bs

PDF Abstract ICCV 2017 PDF ICCV 2017 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image-to-Image Translation ADE20K-Outdoor Labels-to-Photos CRN mIoU 16.5 # 5
Accuracy 68.6% # 4
FID 99.0 # 7
Image-to-Image Translation Cityscapes Labels-to-Photo CRN Per-pixel Accuracy 77.1% # 6
mIoU 52.4 # 11
FID 104.7 # 15

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Image-to-Image Translation ADE20K Labels-to-Photos CRN mIoU 22.4 # 8
Accuracy 68.8% # 7
FID 73.3 # 13
Image-to-Image Translation COCO-Stuff Labels-to-Photos CRN mIoU 23.7 # 5
Accuracy 40.4% # 6
FID 70.4 # 13

Methods