Video-based Generative Performance Benchmarking (Contextual Understanding)

11 papers with code • 1 benchmarks • 1 datasets

The benchmark evaluates a generative Video Conversational Model with respect to Contextual Understanding.

We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.

Most implemented papers

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

PLLay: Efficient Topological Layer based on Persistence Landscapes

jisuk1/pllay NeurIPS 2020

We propose PLLay, a novel topological layer for general deep learning models based on persistence landscapes, in which we can efficiently exploit the underlying topological features of the input data structure.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi 14 Nov 2023

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

VideoChat: Chat-Centric Video Understanding

opengvlab/ask-anything 10 May 2023

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

damo-nlp-sg/video-llama 5 Jun 2023

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

mbzuai-oryx/video-chatgpt 8 Jun 2023

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

rese1f/MovieChat 31 Jul 2023

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.

One For All: Video Conversation is Feasible Without Video Instruction Tuning

farewellthree/BT-Adapter 27 Sep 2023

Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

opengvlab/ask-anything 28 Nov 2023

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.