no code implementations • 24 Nov 2023 • Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Tingting Gao, Jinzhi Wang
The goal of conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i. e., one image and text. The previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity.
no code implementations • 31 Aug 2022 • Zilun Zhang, Cuifeng Shen, Yuan Shen, Huixin Xiong, Xinyu Zhou
Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e. g., 224), subtle details are lost in the feature representation if we input high-resolution images (e. g., 2240).