Video Description
26 papers with code • 0 benchmarks • 7 datasets
The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.
Source: Joint Event Detection and Description in Continuous Video Streams
Benchmarks
These leaderboards are used to track progress in Video Description
Datasets
Latest papers
Delving Deeper into the Decoder for Video Captioning
Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence.
VizSeq: A Visual Analysis Toolkit for Text Generation Tasks
Automatic evaluation of text generation tasks (e. g. machine translation, text summarization, image captioning and video description) usually relies heavily on task-specific metrics, such as BLEU and ROUGE.
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.
Grounded Video Description
Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase.
Adversarial Inference for Multi-Sentence Video Description
Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features
We introduce a new dataset of dialogs about videos of human behaviors.
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.
Predicting Visual Features from Text for Image and Video Caption Retrieval
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video.
Egocentric Video Description based on Temporally-Linked Sequences
We propose a novel methodology that exploits information from temporally neighboring events, matching precisely the nature of egocentric sequences.
Memory-augmented Attention Modelling for Videos
We present a method to improve video description generation by modeling higher-order interactions between video frames and described concepts.