TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video-based Generative Performance Benchmarking	VideoInstruct	LITA-13B	Correctness of Information	2.94	# 8
Video-based Generative Performance Benchmarking	VideoInstruct	LITA-13B	Detail Orientation	2.98	# 7
Video-based Generative Performance Benchmarking	VideoInstruct	LITA-13B	Contextual Understanding	3.43	# 9
Video-based Generative Performance Benchmarking	VideoInstruct	LITA-13B	Temporal Understanding	2.68	# 4
Video-based Generative Performance Benchmarking	VideoInstruct	LITA-13B	Consistency	3.19	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	LITA-13B	mean	3.04	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lita-language-instructed-temporal/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=lita-language-instructed-temporal)`

LITA: Language Instructed Temporal-Localization Assistant

27 Mar 2024 · De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz ·

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

PDF Abstract

Code

Add Remove Mark official

nvlabs/lita official

105

Tasks

Add Remove

Instruction Following

Temporal Localization

Text Generation

Video-based Generative Performance Benchmarking

Datasets

ActivityNet

ActivityNet Captions VideoInstruct

Results from the Paper

Edit

Ranked #5 on Video-based Generative Performance Benchmarking on VideoInstruct

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video-based Generative Performance Benchmarking	VideoInstruct	LITA-13B	Correctness of Information	2.94	# 8	Compare
			Detail Orientation	2.98	# 7	Compare
			Contextual Understanding	3.43	# 9	Compare
			Temporal Understanding	2.68	# 4	Compare
			Consistency	3.19	# 2	Compare
			mean	3.04	# 5	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

LITA: Language Instructed Temporal-Localization Assistant

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove