Video Classification

172 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Classification

Dataset	Best Model	Compare
Breakfast	MA-LMM	See all
COIN	MA-LMM	See all
YouTube-8M	DCGN (self-attention graph pooling)	See all
MoB	VTN	See all
Hockey Fight Detection Dataset	CNN+LSTM	See all
Kinetics	Multigrid	See all
Charades	Multigrid	See all
Something-Something V1	MSNet-R50En (ours)	See all
Something-Something V2	MSNet-R50En (ours)	See all
Multimodal PISA	MMDL	See all
Home Action Genome	Cooperative Ours (3rd-person)	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Classification models and implementations

open-mmlab/mmaction2

6 papers

3,887

rwightman/pytorch-image-models

3 papers

29,713

facebookresearch/detectron

2 papers

26,137

open-mmlab/mmclassification

2 papers

3,153

See all 6 libraries.

Datasets

Most implemented papers

Most implemented Social Latest No code

Non-local Neural Networks

facebookresearch/video-nonlocal-net • • CVPR 2018

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.

Paper
Code

Group Normalization

ppwwyyxx/GroupNorm-reproduce • • ECCV 2018

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Paper
Code

Video Swin Transformer

SwinTransformer/Video-Swin-Transformer • • CVPR 2022

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.

Paper
Code

Is Space-Time Attention All You Need for Video Understanding?

facebookresearch/TimeSformer • • 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

Paper
Code

Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks

pbashivan/EEGLearn • • 19 Nov 2015

One of the challenges in modeling cognitive events from electroencephalogram (EEG) data is finding representations that are invariant to inter- and intra-subject differences, as well as to inherent noise associated with such data.

Paper
Code

Temporal Segment Networks for Action Recognition in Videos

yjxiong/temporal-segment-networks • • 8 May 2017

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Paper
Code

Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

kenshohara/3D-ResNets-PyTorch • • 10 Apr 2020

Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.

Paper
Code

X3D: Expanding Architectures for Efficient Video Recognition

facebookresearch/SlowFast • • CVPR 2020

This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.

Paper
Code

ViViT: A Video Vision Transformer

google-research/scenic • • ICCV 2021

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Paper
Code

Two-Stream Convolutional Networks for Action Recognition in Videos

feichtenhofer/twostreamfusion • NeurIPS 2014

Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art.

Paper
Code

Video Classification

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result