Generating a Temporally Coherent Image Sequence for a Story by Multimodal Recurrent Transformers

ACL ARR November 2021 · Anonymous ·

Story visualization is a challenging text-to-image generation task for the difficulty of rendering visual details from abstract text descriptions. Besides the difficulty of image generation, the generator also need to conform to the narrative of a multi-sentence story input. While prior arts in this domain has focused on improving semantic relevance between generated images and input text, controlling the generated images to be temporally consistent still remains as a challenge. To generate a semantically coherent image sequence, we propose an explicit memory controller which can augment the temporal coherence of images in the multi-modal autoregressive transformer, and call Story visualization by MultimodAl Recurrent Transformers or SMART for short. Our method generates high resolution high quality images, outperforming prior works by a significant margin across multiple evaluation metrics on PororoSV dataset.

PDF Abstract