Video Classification

172 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Classification

Dataset	Best Model	Compare
Breakfast	MA-LMM	See all
COIN	MA-LMM	See all
YouTube-8M	DCGN (self-attention graph pooling)	See all
MoB	VTN	See all
Hockey Fight Detection Dataset	CNN+LSTM	See all
Kinetics	Multigrid	See all
Charades	Multigrid	See all
Something-Something V1	MSNet-R50En (ours)	See all
Something-Something V2	MSNet-R50En (ours)	See all
Multimodal PISA	MMDL	See all
Home Action Genome	Cooperative Ours (3rd-person)	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Classification models and implementations

open-mmlab/mmaction2

6 papers

3,898

rwightman/pytorch-image-models

3 papers

29,789

facebookresearch/detectron

2 papers

26,145

open-mmlab/mmclassification

2 papers

3,163

See all 6 libraries.

Datasets

Latest papers

Most implemented Social Latest No code

Text-to-feature diffusion for audio-visual few-shot learning

explainableml/avdiff-gfsl • • 7 Sep 2023

Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process.

07 Sep 2023

Paper
Code

Identifying Misinformation on YouTube through Transcript Contextual Analysis with Transformer Models

christoschr97/misinf-detection-llms • 22 Jul 2023

We apply the trained models to three datasets: (a) YouTube Vaccine-misinformation related videos, (b) YouTube Pseudoscience videos, and (c) Fake-News dataset (a collection of articles).

22 Jul 2023

Paper
Code

MUVF-YOLOX: A Multi-modal Ultrasound Video Fusion Network for Renal Tumor Diagnosis

jeunyuli/muaf • • 15 Jul 2023

In addition, we design an object-level temporal aggregation (OTA) module that can automatically filter low-quality features and efficiently integrate temporal information from multiple frames to improve the accuracy of tumor diagnosis.

15 Jul 2023

Paper
Code

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

pku-yuangroup/open-sora-plan • • 12 Jul 2023

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged.

10,134

12 Jul 2023

Paper
Code

Learning Unseen Modality Interaction

gerasmark/Reproducing-Unseen-Modality-Interaction • NeurIPS 2023

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.

22 Jun 2023

Paper
Code

Inflated 3D Convolution-Transformer for Weakly-supervised Carotid Stenosis Grading with Ultrasound Videos

xinruizhou0106/csg-3dct_supp • 5 Jun 2023

First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG.

05 Jun 2023

Paper
Code

Malicious or Benign? Towards Effective Content Moderation for Children's Videos

syedhammadahmed/mob • • 24 May 2023

Online video platforms receive hundreds of hours of uploads every minute, making manual content moderation impossible.

24 May 2023

Paper
Code

HateMM: A Multi-Modal Dataset for Hate Video Classification

hate-alert/hatemm • • 6 May 2023

Hate speech has become one of the most significant issues in modern society, having implications in both the online and the offline world.

06 May 2023

Paper
Code

Verbs in Action: Improving verb understanding in video-language models

google-research/scenic • • ICCV 2023

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.

2,999

13 Apr 2023

Paper
Code

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

showlab/sparseformer • • 7 Apr 2023

In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.

07 Apr 2023

Paper
Code

Video Classification

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result