Video Question Answering

151 papers with code • 20 benchmarks • 31 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Libraries

Use these libraries to find Video Question Answering models and implementations

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

110
08 Apr 2024

LongVLM: Efficient Long Video Understanding via Large Language Models

ziplab/longvlm 4 Apr 2024

In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos.

11
04 Apr 2024

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

riflezhang/llava-hound-dpo 1 Apr 2024

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM).

43
01 Apr 2024

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

imagegridworth/IG-VLM 27 Mar 2024

Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging.

61
27 Mar 2024

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

18
26 Mar 2024

Elysium: Exploring Object-level Perception in Videos via MLLM

hon-wong/elysium 25 Mar 2024

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied.

19
25 Mar 2024

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

mlvlab/vid-tldr 20 Mar 2024

To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.

15
20 Mar 2024

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

yellow-binary-tree/hawkeye 15 Mar 2024

Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.

21
15 Mar 2024

DAM: Dynamic Adapter Merging for Continual Video QA Learning

klauscc/dam 13 Mar 2024

Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.

6
13 Mar 2024

LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

bigai-nlco/lstp-chat 25 Feb 2024

Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time.

15
25 Feb 2024