Dense Video Captioning

25 papers with code • 4 benchmarks • 7 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Dense Video Captioning

Dataset	Best Model	Compare
ActivityNet Captions	Vid2Seq	See all
YouCook2	Vid2Seq (HowTo100M+VidChapters-7M PT)	See all
ViTT	Vid2Seq (VidChapters-7M PT)	See all
VidChapters-7M	Vid2Seq	See all

Datasets

Subtasks

Zero-shot dense video captioning

Most implemented papers

Most implemented Social Latest No code

Streamlined Dense Video Captioning

ttengwang/ESGN • • CVPR 2019

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events.

Paper
Code

Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020

ttengwang/dense-video-captioning-pytorch • • 21 Jun 2020

This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.

Paper
Code

Multimodal Pretraining for Dense Video Captioning

google-research-datasets/Video-Timeline-Tags-ViTT • Asian Chapter of the Association for Computational Linguistics 2020

First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.

Paper
Code

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

HumamAlwassel/TSP • • 23 Nov 2020

Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning.

Paper
Code

Global Object Proposals for Improving Multi-Sentence Video Descriptions

cskanani/global_object_proposals • • International Joint Conference on Neural Networks (IJCNN) 2021

Recently, many works are proposed on the generation of multi-sentence video descriptions.

Paper
Code

Dense Video Captioning Using Unsupervised Semantic Information

valterlej/dvcusi • • 15 Dec 2021

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events (e. g., minutes) can be decomposed into simpler events (e. g., a few seconds), and that these simple events are shared across several complex events.

Paper
Code

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

qiqang/uedvc • • 18 Jul 2022

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.

Paper
Code

Event and Entity Extraction from Generated Video Captions

josch14/semantic-metadata-extraction-from-videos • 5 Nov 2022

Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions.

Paper
Code

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

zjr2000/gvl • • 11 Mar 2023

Our framework is easily extensible to tasks covering visually-grounded language understanding and generation.

Paper
Code

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm • • 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

Paper
Code

Dense Video Captioning

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result