Browse SoTA > Computer Vision > Video Captioning

Video Captioning

51 papers with code · Computer Vision

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Benchmarks

Greatest papers with code

NMT-Keras: a Very Flexible Toolkit with a Focus on Interactive NMT and Online Learning

9 Jul 2018lvapeab/nmt-keras

We present NMT-Keras, a flexible toolkit for training deep learning models, which puts a particular emphasis on the development of advanced applications of neural machine translation systems, such as interactive-predictive translation protocols and long-term adaptation of the translation system via continuous learning.

MACHINE TRANSLATION QUESTION ANSWERING SENTENCE CLASSIFICATION VIDEO CAPTIONING VISUAL QUESTION ANSWERING

OmniNet: A unified architecture for multi-modal multi-task learning

17 Jul 2019subho406/OmniNet

We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.

IMAGE CAPTIONING MULTI-TASK LEARNING PART-OF-SPEECH TAGGING QUESTION ANSWERING VIDEO CAPTIONING VIDEO QUESTION ANSWERING VISUAL QUESTION ANSWERING

ECO: Efficient Convolutional Network for Online Video Understanding

ECCV 2018 mzolfaghari/ECO-efficient-video-understanding

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

Ranked #22 on Action Recognition on Something-Something V1 (using extra training data)

ACTION CLASSIFICATION ACTION CLASSIFICATION ACTION RECOGNITION VIDEO CAPTIONING VIDEO RETRIEVAL VIDEO UNDERSTANDING

Delving Deeper into Convolutional Networks for Learning Video Representations

19 Nov 2015yaoli/arctic-capgen-vid

We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs). Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset.

ACTION RECOGNITION VIDEO CAPTIONING

Oracle performance for visual captioning

14 Nov 2015yaoli/arctic-capgen-vid

The task of associating images and videos with a natural language description has attracted a great amount of attention recently.

IMAGE CAPTIONING LANGUAGE MODELLING VIDEO CAPTIONING

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

CVPR 2018 JaywongWang/DenseVideoCaptioning

We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.

DENSE VIDEO CAPTIONING

End-to-End Dense Video Captioning with Masked Transformer

CVPR 2018 salesforce/densecap

To address this problem, we propose an end-to-end transformer model for dense video captioning.

DENSE VIDEO CAPTIONING

Learning to Generate Grounded Visual Captions without Localization Supervision

1 Jun 2019facebookresearch/ActivityNet-Entities

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.

IMAGE CAPTIONING LANGUAGE MODELLING VIDEO CAPTIONING

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

CVPR 2016 LisaAnne/DCC

Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet.

IMAGE CAPTIONING OBJECT RECOGNITION VIDEO CAPTIONING

Temporal Tessellation: A Unified Approach for Video Analysis

ICCV 2017 dot27/temporal-tessellation

A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.

ACTION DETECTION VIDEO CAPTIONING VIDEO SUMMARIZATION VIDEO UNDERSTANDING