TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Video Generation	MSR-VTT	ModelScopeT2V	FID	11.09	# 3
Text-to-Video Generation	MSR-VTT	ModelScopeT2V	CLIPSIM	0.2930	# 8
Text-to-Video Generation	MSR-VTT	ModelScopeT2V	FVD	550	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/modelscope-text-to-video-technical-report/text-to-video-generation-on-msr-vtt)](https://paperswithcode.com/sota/text-to-video-generation-on-msr-vtt?p=modelscope-text-to-video-technical-report)`

ModelScope Text-to-Video Technical Report

12 Aug 2023 · Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang ·

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.

PDF Abstract