Dense Video Captioning

25 papers with code • 4 benchmarks • 7 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Most implemented papers

Streamlined Dense Video Captioning

ttengwang/ESGN CVPR 2019

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events.

Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020

ttengwang/dense-video-captioning-pytorch 21 Jun 2020

This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.

Multimodal Pretraining for Dense Video Captioning

google-research-datasets/Video-Timeline-Tags-ViTT Asian Chapter of the Association for Computational Linguistics 2020

First, we construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT), featuring a variety of instructional videos together with time-stamped annotations.

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

HumamAlwassel/TSP 23 Nov 2020

Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning.

Dense Video Captioning Using Unsupervised Semantic Information

valterlej/dvcusi 15 Dec 2021

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events (e. g., minutes) can be decomposed into simpler events (e. g., a few seconds), and that these simple events are shared across several complex events.

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

qiqang/uedvc 18 Jul 2022

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.

Event and Entity Extraction from Generated Video Captions

josch14/semantic-metadata-extraction-from-videos 5 Nov 2022

Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions.

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

zjr2000/gvl 11 Mar 2023

Our framework is easily extensible to tasks covering visually-grounded language understanding and generation.

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.