Video-based Generative Performance Benchmarking

13 papers with code • 1 benchmarks • 1 datasets

The benchmark evaluates a generative Video Conversational Model and covers five key aspects:

Correctness of Information
Detailed Orientation
Contextual Understanding
Temporal Understanding
Consistency

We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video-based Generative Performance Benchmarking

Trend	Dataset	Best Model	Paper	Code	Compare
	VideoInstruct	PLLaVA			See all

Datasets

VideoInstruct

Subtasks

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video-based Generative Performance Benchmarking (Consistency)

Most implemented papers

Most implemented Social Latest No code

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter • 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Paper
Code

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi • • 14 Nov 2023

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

Paper
Code

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

dvlab-research/llama-vid • • 28 Nov 2023

Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.

Paper
Code

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

opengvlab/ask-anything • • 28 Nov 2023

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Paper
Code

VideoChat: Chat-Centric Video Understanding

opengvlab/ask-anything • • 10 May 2023

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.

Paper
Code

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

damo-nlp-sg/video-llama • • 5 Jun 2023

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.

Paper
Code

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

mbzuai-oryx/video-chatgpt • • 8 Jun 2023

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.

Paper
Code

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

rese1f/MovieChat • • 31 Jul 2023

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.

Paper
Code

One For All: Video Conversation is Feasible Without Video Instruction Tuning

farewellthree/BT-Adapter • • 27 Sep 2023

Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.

Paper
Code

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm • • 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

Paper
Code

Video-based Generative Performance Benchmarking

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result