Video Classification

178 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Classification

Dataset	Best Model	Compare
Breakfast	MA-LMM	See all
COIN	MA-LMM	See all
YouTube-8M	DCGN (self-attention graph pooling)	See all
MoB	VTN	See all
Hockey Fight Detection Dataset	CNN+LSTM	See all
Kinetics	Multigrid	See all
Charades	Multigrid	See all
Something-Something V1	MSNet-R50En (ours)	See all
Something-Something V2	MSNet-R50En (ours)	See all
Multimodal PISA	MMDL	See all
Home Action Genome	Cooperative Ours (3rd-person)	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Classification models and implementations

open-mmlab/mmaction2

6 papers

3,987

rwightman/pytorch-image-models

3 papers

30,320

facebookresearch/detectron

2 papers

26,178

open-mmlab/mmclassification

2 papers

3,241

See all 6 libraries.

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification

ojedaf/TNT • • 21 Jun 2021

Recently, few-shot learning has received increasing interest.

Paper
Code

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

MTLab/MorphMLP • • 24 Nov 2021

With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.

Paper
Code

A Dataset for Medical Instructional Video Classification and Question Answering

deepaknlp/medvidqacl • • 30 Jan 2022

This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions.

Paper
Code

Temporal and cross-modal attention for audio-visual zero-shot learning

explainableml/tcaf-gzsl • • 20 Jul 2022

We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning.

Paper
Code

TAD: A Large-Scale Benchmark for Traffic Accidents Detection from Video Surveillance

forAcadamic/T-CAD • • 26 Sep 2022

After integration and annotation by various dimensions, a large-scale traffic accidents dataset named TAD is proposed in this work.

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Paper
Code

Large-Scale Video Classification with Convolutional Neural Networks

lRomul/ball-action-spotting • • 2014 IEEE Conference on Computer Vision and Pattern Recognition 2014

We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63. 3% up from 43. 9%).

Paper
Code

Beyond Short Snippets: Deep Networks for Video Classification

shobrook/sequitur • • CVPR 2015

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval.

Paper
Code

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

tejgvsl/Camera-motion-classification-in-a-video-file • • 7 Apr 2015

In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos.

Paper
Code

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding

Munsebum/AI-Project_A1 • • CVPR 2015

In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize.

Paper
Code

Video Classification

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result