TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	ActivityNet-QA	ST-LLM	Confidence Score	3.3	# 5
Zero-Shot Video Question Answer	ActivityNet-QA	ST-LLM	Accuracy	50.9	# 3
Zero-Shot Video Question Answer	MSRVTT-QA	ST-LLM	Accuracy	63.2	# 4
Zero-Shot Video Question Answer	MSRVTT-QA	ST-LLM	Confidence Score	3.4	# 5
Zero-Shot Video Question Answer	MSVD-QA	ST-LLM	Accuracy	74.6	# 5
Zero-Shot Video Question Answer	MSVD-QA	ST-LLM	Confidence Score	3.9	# 3
Video Question Answering	MVBench	ST-LLM	Avg.	54.9	# 2
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	ST-LLM	gpt-score	3.74	# 2
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	ST-LLM	gpt-score	2.93	# 1
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	ST-LLM	gpt-score	3.05	# 3
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	ST-LLM	gpt-score	3.23	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	ST-LLM	Correctness of Information	3.23	# 3
Video-based Generative Performance Benchmarking	VideoInstruct	ST-LLM	Detail Orientation	3.05	# 4
Video-based Generative Performance Benchmarking	VideoInstruct	ST-LLM	Contextual Understanding	3.74	# 2
Video-based Generative Performance Benchmarking	VideoInstruct	ST-LLM	Temporal Understanding	2.93	# 1
Video-based Generative Performance Benchmarking	VideoInstruct	ST-LLM	Consistency	2.81	# 5
Video-based Generative Performance Benchmarking	VideoInstruct	ST-LLM	mean	3.15	# 3
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	ST-LLM	gpt-score	2.81	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=st-llm-large-language-models-are-effective-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/st-llm-large-language-models-are-effective-1/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=st-llm-large-language-models-are-effective-1)`

ST-LLM: Large Language Models Are Effective Temporal Learners

30 Mar 2024 · Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li ·

Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://github.com/TencentARC/ST-LLM.

PDF Abstract

Code

Add Remove Mark official

TencentARC/ST-LLM official

Tasks

Add Remove

Reading Comprehension

Video Understanding

Datasets

ActivityNet

ActivityNet-QA MSRVTT-QA MSVD-QA VideoInstruct MVBench

Results from the Paper

Add Remove

Ranked #1 on Video-based Generative Performance Benchmarking (Temporal Understanding) on VideoInstruct

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	ActivityNet-QA	ST-LLM	Confidence Score	3.3	# 5	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	ST-LLM	Accuracy	50.9	# 3	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	ST-LLM	Accuracy	63.2	# 4	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	ST-LLM	Confidence Score	3.4	# 5	Compare
Zero-Shot Video Question Answer	MSVD-QA	ST-LLM	Accuracy	74.6	# 5	Compare
Zero-Shot Video Question Answer	MSVD-QA	ST-LLM	Confidence Score	3.9	# 3	Compare
Video Question Answering	MVBench	ST-LLM	Avg.	54.9	# 2	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	ST-LLM	gpt-score	3.74	# 2	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	ST-LLM	gpt-score	2.93	# 1	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	ST-LLM	gpt-score	3.05	# 3	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	ST-LLM	gpt-score	3.23	# 2	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	ST-LLM	Correctness of Information	3.23	# 3	Compare
			Detail Orientation	3.05	# 4	Compare
			Contextual Understanding	3.74	# 2	Compare
			Temporal Understanding	2.93	# 1	Compare
			Consistency	2.81	# 5	Compare
			mean	3.15	# 3	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	ST-LLM	gpt-score	2.81	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

ST-LLM: Large Language Models Are Effective Temporal Learners

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove