Dense Video Captioning
24 papers with code • 4 benchmarks • 7 datasets
Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.
Latest papers
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems.
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.
Streaming Dense Video Captioning
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.
OmniVid: A Generative Framework for Universal Video Understanding
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.
VTimeLLM: Empower LLM to Grasp Video Moments
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.
SoccerNet 2023 Challenges Results
More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Our framework is easily extensible to tasks covering visually-grounded language understanding and generation.
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Event and Entity Extraction from Generated Video Captions
Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions.
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.