Video Classification

172 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Libraries

Use these libraries to find Video Classification models and implementations

Most implemented papers

Token Shift Transformer for Video Classification

VideoNetworks/TokShift-Transformer 5 Aug 2021

It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding.

Deep Temporal Linear Encoding Networks

bryanyzhu/two-stream-pytorch CVPR 2017

Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space; (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification; and (c) they model feature interactions in a more expressive way and without loss of information.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

ZhaofanQiu/pseudo-3d-residual-networks ICCV 2017

In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating $3\times3\times3$ convolutions with $1\times3\times3$ convolutional filters on spatial domain (equivalent to 2D CNN) plus $3\times1\times1$ convolutions to construct temporal connections on adjacent feature maps in time.

Compact Generalized Non-local Network

KaiyuYue/cgnl-network.pytorch NeurIPS 2018

The non-local module is designed for capturing long-range spatio-temporal dependencies in images and videos.

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

tensorflow/models ICLR 2020

Learning to represent videos is a very challenging task both algorithmically and computationally.

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

arunos728/MotionSqueeze ECCV 2020

As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding.

VideoMix: Rethinking Data Augmentation for Video Classification

jayChung0302/videomix 7 Dec 2020

Recent data augmentation strategies have been reported to address the overfitting problems in static image classifiers.

Reinforcement Learning with Latent Flow

WendyShang/flare NeurIPS 2021

Temporal information is essential to learning effective policies with Reinforcement Learning (RL).

Busy-Quiet Video Disentangling for Video Classification

guoxih/Busy-Quiet-Video-Disentangling-for-Video-Classification 29 Mar 2021

We design a trainable Motion Band-Pass Module (MBPM) for separating busy information from quiet information in raw video data.

Out-of-Distribution Detection Using Union of 1-Dimensional Subspaces

zaeemzadeh/OOD CVPR 2021

In this paper, we argue that OOD samples can be detected more easily if the training data is embedded into a low-dimensional space, such that the embedded training samples lie on a union of 1-dimensional subspaces.