Video Classification

172 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Classification

Dataset	Best Model	Compare
Breakfast	MA-LMM	See all
COIN	MA-LMM	See all
YouTube-8M	DCGN (self-attention graph pooling)	See all
MoB	VTN	See all
Hockey Fight Detection Dataset	CNN+LSTM	See all
Kinetics	Multigrid	See all
Charades	Multigrid	See all
Something-Something V1	MSNet-R50En (ours)	See all
Something-Something V2	MSNet-R50En (ours)	See all
Multimodal PISA	MMDL	See all
Home Action Genome	Cooperative Ours (3rd-person)	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Classification models and implementations

open-mmlab/mmaction2

6 papers

3,876

rwightman/pytorch-image-models

3 papers

29,680

facebookresearch/detectron

2 papers

26,140

open-mmlab/mmclassification

2 papers

3,140

See all 6 libraries.

Datasets

Most implemented papers

Most implemented Social Latest No code

Video Classification with Channel-Separated Convolutional Networks

facebookresearch/VMZ • • ICCV 2019

It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.

Paper
Code

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

facebookresearch/detectron2 • • CVPR 2022

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.

Paper
Code

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

sense-x/uniformer • • 24 Jan 2022

Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning.

Paper
Code

YouTube-8M: A Large-Scale Video Classification Benchmark

google/youtube-8m • • 27 Sep 2016

Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow.

Paper
Code

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding • • ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

Paper
Code

Learnable pooling with Context Gating for video classification

antoine77340/Youtube-8M-WILLOW • • 21 Jun 2017

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Paper
Code

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

longxiang92/Flash-MNIST • • CVPR 2018

In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets.

Paper
Code

Representation Flow for Action Recognition

piergiaj/representation-flow-cvpr19 • • CVPR 2019

Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.

Paper
Code

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

whwu95/text4vis • • 4 Jul 2022

In this study, we focus on transferring knowledge for video classification tasks.

Paper
Code

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

chihyaoma/Activity-Recognition-with-CNN-and-RNN • • 30 Mar 2017

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

Paper
Code

Video Classification

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result