TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Dense Video Captioning	ActivityNet Captions	VTimeLLM	CIDEr	27.6	# 5
Dense Video Captioning	ActivityNet Captions	VTimeLLM	SODA	5.8	# 4
Video-based Generative Performance Benchmarking	VideoInstruct	VTimeLLM	Correctness of Information	2.78	# 10
Video-based Generative Performance Benchmarking	VideoInstruct	VTimeLLM	Detail Orientation	3.10	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	VTimeLLM	Contextual Understanding	3.40	# 10
Video-based Generative Performance Benchmarking	VideoInstruct	VTimeLLM	Temporal Understanding	2.49	# 8
Video-based Generative Performance Benchmarking	VideoInstruct	VTimeLLM	Consistency	2.47	# 10
Video-based Generative Performance Benchmarking	VideoInstruct	VTimeLLM	mean	2.85	# 10
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	VTimeLLM	gpt-score	3.10	# 2
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	VTimeLLM	gpt-score	3.40	# 6
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	VTimeLLM	gpt-score	2.78	# 6
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	VTimeLLM	gpt-score	2.49	# 5
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	VTimeLLM	gpt-score	2.47	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=vtimellm-empower-llm-to-grasp-video-moments)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=vtimellm-empower-llm-to-grasp-video-moments)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=vtimellm-empower-llm-to-grasp-video-moments)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=vtimellm-empower-llm-to-grasp-video-moments)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=vtimellm-empower-llm-to-grasp-video-moments)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=vtimellm-empower-llm-to-grasp-video-moments)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vtimellm-empower-llm-to-grasp-video-moments/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=vtimellm-empower-llm-to-grasp-video-moments)`

VTimeLLM: Empower LLM to Grasp Video Moments

30 Nov 2023 · Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu ·

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.

PDF Abstract

Code

Add Remove Mark official

huangb23/vtimellm official

125

Tasks

Add Remove

Dense Video Captioning

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Captioning

Video Grounding

Datasets

ActivityNet Captions

Charades-STA

DiDeMo

WebVid VideoInstruct InternVid

Results from the Paper

Edit

Ranked #2 on Video-based Generative Performance Benchmarking (Detail Orientation)) on VideoInstruct

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Dense Video Captioning	ActivityNet Captions	VTimeLLM	CIDEr	27.6	# 5	Compare
Dense Video Captioning	ActivityNet Captions	VTimeLLM	SODA	5.8	# 4	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	VTimeLLM	Correctness of Information	2.78	# 10	Compare
			Detail Orientation	3.10	# 2	Compare
			Contextual Understanding	3.40	# 10	Compare
			Temporal Understanding	2.49	# 8	Compare
			Consistency	2.47	# 10	Compare
			mean	2.85	# 10	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	VTimeLLM	gpt-score	3.10	# 2	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	VTimeLLM	gpt-score	3.40	# 6	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	VTimeLLM	gpt-score	2.78	# 6	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	VTimeLLM	gpt-score	2.49	# 5	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	VTimeLLM	gpt-score	2.47	# 6	Compare

Methods

Add Remove

ALIGN

Edit Social Preview

VTimeLLM: Empower LLM to Grasp Video Moments

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove