Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Video Generation UCF-101 PYoCo (Zero-shot, 64x64, unconditional) Inception Score 60.01 # 9
FVD16 310 # 13
Video Generation UCF-101 PYoCo (Zero-shot, 64x64, text-conditional) Inception Score 47.76 # 15
FVD16 355.19 # 19
Text-to-Video Generation UCF-101 PYoCo (Zero-shot, 64x64) FVD16 355.19 # 8

Methods