Video Classification

172 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Classification

Dataset	Best Model	Compare
Breakfast	MA-LMM	See all
COIN	MA-LMM	See all
YouTube-8M	DCGN (self-attention graph pooling)	See all
MoB	VTN	See all
Hockey Fight Detection Dataset	CNN+LSTM	See all
Kinetics	Multigrid	See all
Charades	Multigrid	See all
Something-Something V1	MSNet-R50En (ours)	See all
Something-Something V2	MSNet-R50En (ours)	See all
Multimodal PISA	MMDL	See all
Home Action Genome	Cooperative Ours (3rd-person)	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Classification models and implementations

open-mmlab/mmaction2

6 papers

3,908

rwightman/pytorch-image-models

3 papers

29,828

facebookresearch/detectron

2 papers

26,147

open-mmlab/mmclassification

2 papers

3,171

See all 6 libraries.

Datasets

Most implemented papers

Most implemented Social Latest No code

Billion-scale semi-supervised learning for image classification

facebookresearch/semi-supervised-ImageNet1K-models • • 2 May 2019

This paper presents a study of semi-supervised learning with large convolutional networks.

Paper
Code

Reversible Vision Transformers

facebookresearch/SlowFast • • CVPR 2022

Reversible Vision Transformers achieve a reduced memory footprint of up to 15. 5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes.

Paper
Code

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

MohsenFayyaz89/T3D • • 22 Nov 2017

Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e. g. Sports-1M, and finetuned on the target datasets, e. g. HMDB51/UCF101.

Paper
Code

Fine-grained Activity Recognition in Baseball Videos

piergiaj/mlb-youtube • • 9 Apr 2018

In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection.

Paper
Code

Timeception for Complex Action Recognition

noureldien/timeception • • CVPR 2019

This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued.

Paper
Code

Gated Channel Transformation for Visual Recognition

z-x-yang/GCT • • CVPR 2020

This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters.

Paper
Code

A Multigrid Method for Efficiently Training Video Models

facebookresearch/SlowFast • • CVPR 2020

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).

Paper
Code

Non-Local Neural Networks With Grouped Bilinear Attentional Transforms

BA-Transform/BAT-Image-Classification • • CVPR 2020

The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions.

Paper
Code

Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition

iduta/pyconv • • 20 Jun 2020

This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.

Paper
Code

Revisiting ResNets: Improved Training and Scaling Strategies

tensorflow/tpu • • NeurIPS 2021

Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1. 7x - 2. 7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.

Paper
Code

Video Classification

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result