87 papers with code • 0 benchmarks • 25 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera.
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #2 on Temporal Action Localization on J-HMDB-21
Learning to represent videos is a very challenging task both algorithmically and computationally.
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Ranked #1 on Video Classification on Kinetics
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Ranked #10 on Video Object Detection on ImageNet VID
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.
Ranked #46 on Action Recognition on HMDB-51