Video Classification
172 papers with code • 11 benchmarks • 17 datasets
Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.
Libraries
Use these libraries to find Video Classification models and implementationsDatasets
Most implemented papers
Video Classification with Channel-Separated Convolutional Networks
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning.
YouTube-8M: A Large-Scale Video Classification Benchmark
Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow.
ECO: Efficient Convolutional Network for Online Video Understanding
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
Learnable pooling with Context Gating for video classification
In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets.
Representation Flow for Action Recognition
Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
In this study, we focus on transferring knowledge for video classification tasks.
TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.