Video Description
26 papers with code • 0 benchmarks • 7 datasets
The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.
Source: Joint Event Detection and Description in Continuous Video Streams
Benchmarks
These leaderboards are used to track progress in Video Description
Datasets
Latest papers with no code
Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering
To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Network (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously.
Prediction and Description of Near-Future Activities in Video
Most of the existing works on human activity analysis focus on recognition or early recognition of the activity labels from complete or partial observations.
End-to-End Video Captioning
The decoder is then optimised on such static features to generate the video's description.
A Dataset for Telling the Stories of Social Media Videos
Video content on social media platforms constitutes a major part of the communication between people, as it allows everyone to share their stories.
Incorporating Background Knowledge into Video Description Generation
We develop an approach that uses video meta-data to retrieve topically related news documents for a video and extracts the events and named entities from these documents.
Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions
We validate the effectiveness of our ASST on two large-scale datasets.
Bridge Video and Text with Cascade Syntactic Structure
We present a video captioning approach that encodes features by progressively completing syntactic structure (LSTM-CSS).
Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data
In this paper, we investigate the effectiveness of training a multimodal neural machine translation (MNMT) system with image features for a low-resource language pair, Hindi and English, using synthetic data.
Video Description: A Survey of Methods, Datasets and Evaluation Metrics
Video description is the automatic generation of natural language sentences that describe the contents of a given video.
Interpretable Video Captioning via Trajectory Structured Localization
Automatically describing open-domain videos with natural language are attracting increasing interest in the field of artificial intelligence.