TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Video Generation	MSR-VTT	Snap Video (512x288)	CLIPSIM	0.2793	# 10
Text-to-Video Generation	MSR-VTT	Snap Video (512x288)	FVD	104.0	# 1
Text-to-Video Generation	MSR-VTT	Snap Video (512x288)	CLIP-FID	9.35	# 2
Text-to-Video Generation	MSR-VTT	Snap Video (288×288)	CLIPSIM	0.2793	# 10
Text-to-Video Generation	MSR-VTT	Snap Video (288×288)	FVD	110.4	# 2
Text-to-Video Generation	MSR-VTT	Snap Video (288×288)	CLIP-FID	8.48	# 1
Text-to-Video Generation	UCF-101	Snap Video (Zero-shot, 512x288)	FVD16	200.2	# 2
Text-to-Video Generation	UCF-101	Snap Video (Zero-shot, 288×288)	FVD16	260.1	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/snap-video-scaled-spatiotemporal-transformers/text-to-video-generation-on-msr-vtt)](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt?p=snap-video-scaled-spatiotemporal-transformers)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/snap-video-scaled-spatiotemporal-transformers/text-to-video-generation-on-ucf-101)](https://paperswithcode.com/sota/text-to-video-generation-on-ucf-101?p=snap-video-scaled-spatiotemporal-transformers)`

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

22 Feb 2024 · Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov ·

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Generation

Text-to-Video Generation

Video Generation

Datasets

UCF101

MSR-VTT

Results from the Paper

Add Remove

Ranked #1 on Text-to-Video Generation on MSR-VTT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Video Generation	MSR-VTT	Snap Video (512x288)	CLIPSIM	0.2793	# 10	Compare
			FVD	104.0	# 1	Compare
			CLIP-FID	9.35	# 2	Compare
Text-to-Video Generation	MSR-VTT	Snap Video (288×288)	CLIPSIM	0.2793	# 10	Compare
			FVD	110.4	# 2	Compare
			CLIP-FID	8.48	# 1	Compare
Text-to-Video Generation	UCF-101	Snap Video (Zero-shot, 512x288)	FVD16	200.2	# 2	Compare
Text-to-Video Generation	UCF-101	Snap Video (Zero-shot, 288×288)	FVD16	260.1	# 5	Compare

Methods

Add Remove

Concatenated Skip Connection • Convolution • Max Pooling • ReLU • U-Net

Edit Social Preview

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove