Browse SoTA > Computer Vision > Video > Video Classification

Video Classification

54 papers with code · Computer Vision
Subtask of Video

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is a one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Benchmarks

Greatest papers with code

Group Normalization

ECCV 2018 facebookresearch/detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

OBJECT DETECTION VIDEO CLASSIFICATION

Non-local Neural Networks

CVPR 2018 facebookresearch/detectron

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.

Ranked #7 on Keypoint Detection on COCO (Validation AP metric)

INSTANCE SEGMENTATION KEYPOINT DETECTION OBJECT DETECTION VIDEO CLASSIFICATION

X3D: Expanding Architectures for Efficient Video Recognition

CVPR 2020 facebookresearch/SlowFast

This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.

FEATURE SELECTION IMAGE CLASSIFICATION VIDEO CLASSIFICATION VIDEO RECOGNITION

A Multigrid Method for Efficiently Training Video Models

CVPR 2020 facebookresearch/SlowFast

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).

ACTION DETECTION ACTION RECOGNITION VIDEO UNDERSTANDING

Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

10 Apr 2020kenshohara/3D-ResNets-PyTorch

Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.

VIDEO CLASSIFICATION VIDEO RECOGNITION

YouTube-8M: A Large-Scale Video Classification Benchmark

27 Sep 2016google/youtube-8m

Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow.

 Ranked #1 on Action Recognition on ActivityNet (using extra training data)

ACTION RECOGNITION

Temporal Segment Networks for Action Recognition in Videos

8 May 2017open-mmlab/mmaction

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Ranked #10 on Action Classification on Moments in Time (Top 5 Accuracy metric)

ACTION CLASSIFICATION ACTION RECOGNITION ACTION RECOGNITION IN VIDEOS ACTION RECOGNITION IN VIDEOS

Video Classification with Channel-Separated Convolutional Networks

ICCV 2019 facebookresearch/R2Plus1D

It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.

ACTION CLASSIFICATION ACTION RECOGNITION IMAGE CLASSIFICATION

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

30 Mar 2017jeffreyhuang1/two-stream-action-recognition

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

ACTION CLASSIFICATION ACTION RECOGNITION VIDEO UNDERSTANDING

Learnable pooling with Context Gating for video classification

21 Jun 2017antoine77340/Youtube-8M-WILLOW

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

VIDEO CLASSIFICATION VIDEO UNDERSTANDING