Learning to Generate Videos Using Neural Uncertainty Priors
Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Previous approaches attempt to tackle this issue by estimating a latent prior characterizing this stochasticity. Nonetheless, the cost function used in training such systems derives the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training when the predictive uncertainty is high. To this end, we introduce Neural Uncertainty Priors (NUP), that produces a stochastic quantification of the uncertainty predicted by the latent prior, and use it to weigh the MSE loss. We propose a variational framework to derive the NUP in a principled manner using a hierarchical Bayesian deep graphical model. We further propose a sequence discriminator that ensures the generated frames are always realistically plausible, even when the uncertainty is consistently high. Our experiments show that NUP leads to more effective training than state-of-the-art models, especially when the training sets are small, while demonstrating better video generation quality and diversity.
PDF Abstract