TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Generation	ImageNet 256x256	MAGVIT-v2 (w/o guidance)	FID	3.65	# 21
Image Generation	ImageNet 256x256	MAGVIT-v2	FID	1.78	# 5
Image Generation	ImageNet 512x512	MAGVIT-v2 (w/o guidance)	FID	3.07	# 15
Image Generation	ImageNet 512x512	MAGVIT-v2 (w/o guidance)	Inception score	213.1	# 9
Image Generation	ImageNet 512x512	MAGVIT-v2	FID	1.91	# 4
Image Generation	ImageNet 512x512	MAGVIT-v2	Inception score	324.3	# 3
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT-v2	FVD	4.3±0.1	# 2
Video Generation	Kinetics-600 12 frames, 64x64	MAGVIT-v2	FVD	4.3±0.1	# 2
Video Generation	UCF-101	MAGVIT-v2	FVD16	58±3	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/video-prediction-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-prediction-on-kinetics-600-12-frames?p=language-model-beats-diffusion-tokenizer-is)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/video-generation-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-generation-on-kinetics-600-12-frames?p=language-model-beats-diffusion-tokenizer-is)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=language-model-beats-diffusion-tokenizer-is)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/image-generation-on-imagenet-512x512)](https://paperswithcode.com/sota/image-generation-on-imagenet-512x512?p=language-model-beats-diffusion-tokenizer-is)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-model-beats-diffusion-tokenizer-is/image-generation-on-imagenet-256x256)](https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?p=language-model-beats-diffusion-tokenizer-is)`

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

9 Oct 2023 · Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang ·

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Image Generation

Language Modelling

Video Compression

Video Generation

Video Prediction

Datasets

ImageNet

UCF101

Kinetics

Kinetics-600

Results from the Paper

Edit

Ranked #2 on Video Prediction on Kinetics-600 12 frames, 64x64

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Generation	ImageNet 256x256	MAGVIT-v2 (w/o guidance)	FID	3.65	# 21	Compare
Image Generation	ImageNet 256x256	MAGVIT-v2	FID	1.78	# 5	Compare
Image Generation	ImageNet 512x512	MAGVIT-v2 (w/o guidance)	FID	3.07	# 15	Compare
Image Generation	ImageNet 512x512	MAGVIT-v2 (w/o guidance)	Inception score	213.1	# 9	Compare
Image Generation	ImageNet 512x512	MAGVIT-v2	FID	1.91	# 4	Compare
Image Generation	ImageNet 512x512	MAGVIT-v2	Inception score	324.3	# 3	Compare
Video Prediction	Kinetics-600 12 frames, 64x64	MAGVIT-v2	FVD	4.3±0.1	# 2	Compare
Video Generation	Kinetics-600 12 frames, 64x64	MAGVIT-v2	FVD	4.3±0.1	# 2	Compare
Video Generation	UCF-101	MAGVIT-v2	FVD16	58±3	# 2	Compare

Methods

Add Remove

Diffusion

Edit Social Preview

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove