Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
We present NMT-Keras, a flexible toolkit for training deep learning models, which puts a particular emphasis on the development of advanced applications of neural machine translation systems, such as interactive-predictive translation protocols and long-term adaptation of the translation system via continuous learning.
We also show that using this neural network pre-trained on some modalities assists in learning unseen tasks such as video captioning and video question answering.
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
Ranked #22 on Action Recognition on Something-Something V1 (using extra training data)
We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs). Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset.
We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions.
To address this problem, we propose an end-to-end transformer model for dense video captioning.
Ranked #2 on Video Captioning on YouCook2
When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.
Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet.
A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.
Ranked #5 on Video Retrieval on MSR-VTT