Generating a Temporally Coherent Visual Story by Multimodal Recurrent Transformers

ACL ARR January 2022 · Anonymous ·

Story visualization is a challenging text-to-image generation task for the difficulty of rendering visual details from abstract text descriptions. Besides the difficulty of image generation, the generator also needs to conform to the narrative of a multi-sentence story input. While prior arts in this domain have focused on improving semantic relevance between generated images and input text, controlling the generated images to be temporally consistent still remains a challenge. Moreover, existing generators are trained on single text-image pairs and fail to consider the variations of natural language captions that can describe a given image, causing poor model generalization. To address such problems, we leverage a cyclic training methodology involving pseudo-text descriptions as an intermediate step that decouples the image’s visual appearance from the variations of natural language descriptions. Additionally, to generate a semantically coherent image sequence, we consider an explicit memory controller which can augment the temporal coherence of images in the multi-modal autoregressive transformer. To sum up all components, we call it Cyclic Story visualization by MultimodAl Recurrent Transformers or C-SMART for short. Our method generates high-resolution, high-quality images, outperforming prior works by a significant margin across multiple evaluation metrics on the Pororo-SV dataset.

PDF Abstract