TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Lavie	Visual Quality	52.83	# 4
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Lavie	Total Score	234	# 2
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Lavie	Text-to-Video Alignment	68.49	# 1
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Lavie	Temporal Consistency	54.23	# 5
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Lavie	Motion Quality	57.99	# 3
Text-to-Video Generation	UCF-101	LAVIE (Zero-shot, 320x512)	FVD16	526.30	# 9
Video Generation	UCF-101	LAVIE (320x512, text-conditional)	FVD16	526.30	# 26

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lavie-high-quality-video-generation-with/text-to-video-generation-on-evalcrafter-text)](https://paperswithcode.com/sota/text-to-video-generation-on-evalcrafter-text?p=lavie-high-quality-video-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lavie-high-quality-video-generation-with/text-to-video-generation-on-ucf-101)](https://paperswithcode.com/sota/text-to-video-generation-on-ucf-101?p=lavie-high-quality-video-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lavie-high-quality-video-generation-with/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=lavie-high-quality-video-generation-with)`

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

26 Sep 2023 · Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu ·

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

PDF Abstract

Code

Add Remove Mark official

Vchitect/LaVie official

↳ Quickstart in

Spaces

Replicate

728

arthur-qiu/freenoise-lavie

Tasks

Add Remove

Super-Resolution

Text-to-Video Generation

Video Generation

Video Super-Resolution

Datasets

UCF101

WebVid

LAION-5B EvalCrafter Text-to-Video (ECTV) Dataset

Results from the Paper

Edit

Ranked #4 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Lavie	Visual Quality	52.83	# 4	Compare
			Total Score	234	# 2	Compare
			Text-to-Video Alignment	68.49	# 1	Compare
			Temporal Consistency	54.23	# 5	Compare
			Motion Quality	57.99	# 3	Compare
Text-to-Video Generation	UCF-101	LAVIE (Zero-shot, 320x512)	FVD16	526.30	# 9	Compare
Video Generation	UCF-101	LAVIE (320x512, text-conditional)	FVD16	526.30	# 26	Compare

Methods

Add Remove

BASE • Diffusion

Edit Social Preview

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove