Dense Video Captioning

24 papers with code • 4 benchmarks • 7 datasets

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Latest papers with no code

The 8th AI City Challenge

no code yet • 15 Apr 2024

The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities.

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

no code yet • 12 Apr 2024

Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

no code yet • 3 Apr 2024

We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos.

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

no code yet • 28 Nov 2023

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view.

Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols

no code yet • 5 Nov 2023

Dense Video Captioning (DVC) aims at detecting and describing different events in a given video.

Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges

no code yet • 25 Sep 2023

Furthermore, we benchmark SOTA models for four multimodal tasks on this newly created dataset, which serve as new baselines for surveillance video-and-language understanding.

VidChapters-7M: Video Chapters at Scale

no code yet • NeurIPS 2023

To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

no code yet • 5 Jul 2023

This is accomplished by introducing a soft moment mask that represents a temporal segment in the video and jointly optimizing it with the prefix parameters of a language model.

Visual Transformation Telling

no code yet • 3 May 2023

In this paper, we propose a new visual reasoning task, called Visual Transformation Telling (VTT).

A Review of Deep Learning for Video Captioning

no code yet • 22 Apr 2023

Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction.