TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	EgoSchema (fullset)	LLoVi (7B)	Accuracy	33.5	# 6
Zero-Shot Video Question Answer	EgoSchema (fullset)	LLoVi (GPT-3.5)	Accuracy	50.3	# 1
Zero-Shot Video Question Answer	EgoSchema (subset)	LLoVi (7B)	Accuracy	50.8	# 3
Zero-Shot Video Question Answer	EgoSchema (subset)	LLoVi (GPT-3.5)	Accuracy	57.6	# 2
Zero-Shot Video Question Answer	IntentQA	LLoVi (GPT-4)	Accuracy	64.0	# 2
Zero-Shot Video Question Answer	IntentQA	LLoVi (7B)	Accuracy	53.6	# 5
Zero-Shot Video Question Answer	NExT-GQA	LLoVi (GPT-4)	Acc@GQA	24.3	# 1
Zero-Shot Video Question Answer	NExT-GQA	LLoVi (7B)	Acc@GQA	11.2	# 3
Zero-Shot Video Question Answer	NExT-QA	LLoVi (7B)	Accuracy	54.3	# 11
Zero-Shot Video Question Answer	NExT-QA	LLoVi (GPT-4)	Accuracy	67.7	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-simple-llm-framework-for-long-range-video/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=a-simple-llm-framework-for-long-range-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-simple-llm-framework-for-long-range-video/zero-shot-video-question-answer-on-next-gqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-next-gqa?p=a-simple-llm-framework-for-long-range-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-simple-llm-framework-for-long-range-video/zero-shot-video-question-answer-on-egoschema)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema?p=a-simple-llm-framework-for-long-range-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-simple-llm-framework-for-long-range-video/zero-shot-video-question-answer-on-intentqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-intentqa?p=a-simple-llm-framework-for-long-range-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-simple-llm-framework-for-long-range-video/zero-shot-video-question-answer-on-next-qa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-next-qa?p=a-simple-llm-framework-for-long-range-video)`

A Simple LLM Framework for Long-Range Video Question-Answering

28 Dec 2023 · Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius ·

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.

PDF Abstract

Code

Add Remove Mark official

ceezh/llovi official

Tasks

Add Remove

Language Modelling

Large Language Model

Long-range modeling

Question Answering

Video Question Answering

Video Understanding

Zero-Shot Video Question Answer

Datasets

ActivityNet-QA

NExT-QA

MovieQA EgoSchema IntentQA

NExT-GQA

Results from the Paper

Add Remove

Ranked #1 on Zero-Shot Video Question Answer on NExT-GQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	EgoSchema (fullset)	LLoVi (7B)	Accuracy	33.5	# 6	Compare
Zero-Shot Video Question Answer	EgoSchema (fullset)	LLoVi (GPT-3.5)	Accuracy	50.3	# 1	Compare
Zero-Shot Video Question Answer	EgoSchema (subset)	LLoVi (7B)	Accuracy	50.8	# 3	Compare
Zero-Shot Video Question Answer	EgoSchema (subset)	LLoVi (GPT-3.5)	Accuracy	57.6	# 2	Compare
Zero-Shot Video Question Answer	IntentQA	LLoVi (GPT-4)	Accuracy	64.0	# 2	Compare
Zero-Shot Video Question Answer	IntentQA	LLoVi (7B)	Accuracy	53.6	# 5	Compare
Zero-Shot Video Question Answer	NExT-GQA	LLoVi (GPT-4)	Acc@GQA	24.3	# 1	Compare
Zero-Shot Video Question Answer	NExT-GQA	LLoVi (7B)	Acc@GQA	11.2	# 3	Compare
Zero-Shot Video Question Answer	NExT-QA	LLoVi (7B)	Accuracy	54.3	# 11	Compare
Zero-Shot Video Question Answer	NExT-QA	LLoVi (GPT-4)	Accuracy	67.7	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

A Simple LLM Framework for Long-Range Video Question-Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove