TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Show-1	Visual Quality	53.74	# 2
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Show-1	Total Score	229	# 4
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Show-1	Text-to-Video Alignment	62.07	# 3
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Show-1	Temporal Consistency	60.83	# 2
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Show-1	Motion Quality	52.19	# 5
Text-to-Video Generation	MSR-VTT	Show-1	FID	13.08	# 4
Text-to-Video Generation	MSR-VTT	Show-1	CLIPSIM	0.3072	# 3
Text-to-Video Generation	MSR-VTT	Show-1	FVD	538	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/show-1-marrying-pixel-and-latent-diffusion/text-to-video-generation-on-evalcrafter-text)](https://paperswithcode.com/sota/text-to-video-generation-on-evalcrafter-text?p=show-1-marrying-pixel-and-latent-diffusion)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/show-1-marrying-pixel-and-latent-diffusion/text-to-video-generation-on-msr-vtt)](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt?p=show-1-marrying-pixel-and-latent-diffusion)`

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

27 Sep 2023 · David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, YuChao Gu, Difei Gao, Mike Zheng Shou ·

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

PDF Abstract

Code

Add Remove Mark official

showlab/show-1 official

1,011

Tasks

Add Remove

Text-to-Video Generation

Video Alignment

Video Generation

Datasets

MSR-VTT

WebVid EvalCrafter Text-to-Video (ECTV) Dataset

Results from the Paper

Edit

Ranked #2 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Video Generation	EvalCrafter Text-to-Video (ECTV) Dataset	Show-1	Visual Quality	53.74	# 2	Compare
			Total Score	229	# 4	Compare
			Text-to-Video Alignment	62.07	# 3	Compare
			Temporal Consistency	60.83	# 2	Compare
			Motion Quality	52.19	# 5	Compare
Text-to-Video Generation	MSR-VTT	Show-1	FID	13.08	# 4	Compare
			CLIPSIM	0.3072	# 3	Compare
			FVD	538	# 7	Compare

Methods

Add Remove

Diffusion

Edit Social Preview

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove