Video-based Generative Performance Benchmarking

13 papers with code • 1 benchmarks • 1 datasets

The benchmark evaluates a generative Video Conversational Model and covers five key aspects:

  • Correctness of Information
  • Detailed Orientation
  • Contextual Understanding
  • Temporal Understanding
  • Consistency

We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.

Most implemented papers

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi 14 Nov 2023

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

dvlab-research/llama-vid 28 Nov 2023

Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

opengvlab/ask-anything 28 Nov 2023

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

VideoChat: Chat-Centric Video Understanding

opengvlab/ask-anything 10 May 2023

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

damo-nlp-sg/video-llama 5 Jun 2023

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

mbzuai-oryx/video-chatgpt 8 Jun 2023

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

rese1f/MovieChat 31 Jul 2023

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.

One For All: Video Conversation is Feasible Without Video Instruction Tuning

farewellthree/BT-Adapter 27 Sep 2023

Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.