Dense Video Captioning

24 papers with code • 4 benchmarks • 7 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Dense Video Captioning

Dataset	Best Model	Compare
ActivityNet Captions	Vid2Seq	See all
YouCook2	Vid2Seq (HowTo100M+VidChapters-7M PT)	See all
ViTT	Vid2Seq (VidChapters-7M PT)	See all
VidChapters-7M	Vid2Seq	See all

Datasets

Subtasks

Zero-shot dense video captioning

Latest papers

Most implemented Social Latest No code

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

quangminhdinh/trafficvlm • • 14 Apr 2024

Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems.

14 Apr 2024

Paper
Code

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

faceonlive/ai-research • 11 Apr 2024

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video.

144

11 Apr 2024

Paper
Code

Streaming Dense Video Captioning

google-research/scenic • • 1 Apr 2024

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

2,995

01 Apr 2024

Paper
Code

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid • • 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

26 Mar 2024

Paper
Code

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm • • 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

112

30 Nov 2023

Paper
Code

SoccerNet 2023 Challenges Results

lRomul/ball-action-spotting • • 12 Sep 2023

More information on the tasks, challenges, and leaderboards are available on https://www. soccer-net. org.

12 Sep 2023

Paper
Code

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

zjr2000/gvl • • 11 Mar 2023

Our framework is easily extensible to tasks covering visually-grounded language understanding and generation.

11 Mar 2023

Paper
Code

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

google-research/scenic • • CVPR 2023

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

2,994

27 Feb 2023

Paper
Code

Event and Entity Extraction from Generated Video Captions

josch14/semantic-metadata-extraction-from-videos • 5 Nov 2022

Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions.

05 Nov 2022

Paper
Code

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

qiqang/uedvc • • 18 Jul 2022

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.

18 Jul 2022

Paper
Code

Dense Video Captioning

Benchmarks Add a Result

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result