Video Description
26 papers with code • 0 benchmarks • 7 datasets
The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.
Source: Joint Event Detection and Description in Continuous Video Streams
Benchmarks
These leaderboards are used to track progress in Video Description
Datasets
Latest papers with no code
X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model
The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making.
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.
Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)
Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news.
ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition
With the proposed ActionHub dataset, we further propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module.
Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)
Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain.
Multi Sentence Description of Complex Manipulation Action Videos
Automatic video description requires the generation of natural language statements about the actions, events, and objects in the video.
CLearViD: Curriculum Learning for Video Description
We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task.
Analyzing Political Figures in Real-Time: Leveraging YouTube Metadata for Sentiment Analysis
Sentiment analysis using big data from YouTube videos metadata can be conducted to analyze public opinions on various political figures who represent political parties.
Edit As You Wish: Video Description Editing with Multi-grained Commands
In this paper, we propose a novel Video Description Editing (VDEdit) task to automatically revise an existing video description guided by flexible user requests.
Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation
Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance.