TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Zero-Shot Video Question Answer	MSRVTT-QA	Vista-LLaMA-7B	Accuracy	60.5	# 6
Zero-Shot Video Question Answer	MSRVTT-QA	Vista-LLaMA-7B	Confidence Score	3.3	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vista-llama-reliable-video-narrator-via-equal/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=vista-llama-reliable-video-narrator-via-equal)`

Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens

12 Dec 2023 · Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang ·

Recent advances in large video-language models have displayed promising outcomes in video comprehension. Current approaches straightforwardly convert video into language tokens and employ large language models for multi-modal tasks. However, this method often leads to the generation of irrelevant content, commonly known as "hallucination", as the length of the text increases and the impact of the video diminishes. To address this problem, we propose Vista-LLaMA, a novel framework that maintains the consistent distance between all visual tokens and any language tokens, irrespective of the generated text length. Vista-LLaMA omits relative position encoding when determining attention weights between visual and text tokens, retaining the position encoding for text and text tokens. This amplifies the effect of visual tokens on text generation, especially when the relative distance is longer between visual and text tokens. The proposed attention mechanism significantly reduces the chance of producing irrelevant text related to the video content. Furthermore, we present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame. This approach not only captures the temporal relationship within the video, but also allows less visual tokens to encompass the entire video. Our approach significantly outperforms various previous methods (e.g., Video-ChatGPT, MovieChat) on four challenging open-ended video question answering benchmarks. We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot MSRVTT-QA, setting a new state-of-the-art performance. This project is available at https://jinxxian.github.io/Vista-LLaMA.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Hallucination

Position

Question Answering

Text Generation

Video Question Answering

Zero-Shot Video Question Answer

Datasets

ActivityNet

MSR-VTT

NExT-QA MSRVTT-QA

Results from the Paper

Add Remove

Ranked #6 on Zero-Shot Video Question Answer on MSRVTT-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	MSRVTT-QA	Vista-LLaMA-7B	Accuracy	60.5	# 6	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Vista-LLaMA-7B	Confidence Score	3.3	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove