Video Classification
172 papers with code • 11 benchmarks • 17 datasets
Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.
Libraries
Use these libraries to find Video Classification models and implementationsDatasets
Most implemented papers
Billion-scale semi-supervised learning for image classification
This paper presents a study of semi-supervised learning with large convolutional networks.
Reversible Vision Transformers
Reversible Vision Transformers achieve a reduced memory footprint of up to 15. 5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes.
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e. g. Sports-1M, and finetuned on the target datasets, e. g. HMDB51/UCF101.
Fine-grained Activity Recognition in Baseball Videos
In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection.
Timeception for Complex Action Recognition
This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued.
Gated Channel Transformation for Visual Recognition
This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters.
A Multigrid Method for Efficiently Training Video Models
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Non-Local Neural Networks With Grouped Bilinear Attentional Transforms
The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions.
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.
Revisiting ResNets: Improved Training and Scaling Strategies
Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1. 7x - 2. 7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.