Video Description

26 papers with code • 0 benchmarks • 7 datasets

The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.

Source: Joint Event Detection and Description in Continuous Video Streams

Latest papers with no code

Relational Graph Learning for Grounded Video Description Generation

no code yet • 2 Dec 2021

Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description.

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

no code yet • 7 Nov 2021

To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video.

Visual-aware Attention Dual-stream Decoder for Video Captioning

no code yet • 16 Oct 2021

Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence.

Boosting Video Captioning with Dynamic Loss Network

no code yet • 25 Jul 2021

A significant drawback with existing video captioning methods is that they are optimized over cross-entropy loss function, which is uncorrelated to the de facto evaluation metrics (BLEU, METEOR, CIDER, ROUGE).

Efficient data-driven encoding of scene motion using Eccentricity

no code yet • 3 Mar 2021

This paper presents a novel approach of representing dynamic visual scenes with static maps generated from video/image streams.

The Role of the Input in Natural Language Video Description

no code yet • 9 Feb 2021

Natural Language Video Description (NLVD) has recently received strong interest in the Computer Vision, Natural Language Processing (NLP), Multimedia, and Autonomous Robotics communities.

Unbox the Blackbox: Predict and Interpret YouTube Viewership Using Deep Learning

no code yet • 21 Dec 2020

Although deep learning champions viewership prediction, it lacks interpretability, which is fundamental to increasing the adoption of predictive models and prescribing measurements to improve viewership.

MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish

no code yet • 13 Dec 2020

We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.

A Comprehensive Review on Recent Methods and Challenges of Video Description

no code yet • 30 Nov 2020

In this work, we report a comprehensive survey on the phases of video description approaches, the dataset for video description, evaluation metrics, open competitions for motivating the research on the video description, open challenges in this field, and future research directions.

Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

no code yet • 27 Jul 2020

Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive.