Action Recognition In Videos

64 papers with code • 17 benchmarks • 17 datasets

Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Recognition In Videos

Dataset	Best Model	Compare
Jester (Gesture Recognition)	CPNet Res34, 5 CP	See all
UCF101	STM (ImageNet+Kinetics pretrain)	See all
Something-Something V2	CAST-B/16	See all
Something-Something V1	STM (16 frames, ImageNet pretraining)	See all
Kinetics-400	CAST-B/16	See all
PKU-MMD	MMNet	See all
Sports-1M	G-Blend	See all
FS-Something-Something V2-Small	ITANet	See all
FS-Something-Something V2-Full	ITANet	See all
THUMOS’14	Single-stream R-C3D (two-way buffer)	See all
AVA v2.2	YOWO+LFB*	See all
HMDB-51	STM (ImageNet+Kinetics pretrain)	See all
AVA v2.1	YOWO+LFB*	See all
Kinetics-600	Florence	See all
ActivityNet	LSTM + Pretrained on YT-8M	See all
NTU RGB+D	2D-3D-Softargmax (RGB only)	See all
miniSports	G-Blend	See all

Show all 17 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Recognition In Videos models and implementations

open-mmlab/mmaction2

4 papers

3,908

yjxiong/caffe

3 papers

550

towhee-io/towhee

2 papers

3,001

MichiganCOG/M-PACT

2 papers

106

See all 5 libraries.

Datasets

Subtasks

Action Anticipation

Most implemented papers

Most implemented Social Latest No code

Busy-Quiet Video Disentangling for Video Classification

guoxih/Busy-Quiet-Video-Disentangling-for-Video-Classification • • 29 Mar 2021

We design a trainable Motion Band-Pass Module (MBPM) for separating busy information from quiet information in raw video data.

Paper
Code

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

google-research/google-research • • NeurIPS 2021

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Paper
Code

ActionCLIP: A New Paradigm for Video Action Recognition

sallymmx/actionclip • • 17 Sep 2021

Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune".

Paper
Code

Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer

leonlha/video-action-recognition-collaborative-learning-with-dynamics-via-pso-convnet-transformer • 17 Feb 2023

To extend our approach to video, we integrate ConvNets with state-of-the-art temporal methods such as Transformer and Recurrent Neural Networks.

Paper
Code

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

facebookresearch/hiera • • 1 Jun 2023

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.

Paper
Code

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

rana2149/actnetformer • • 9 Apr 2024

Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples.

Paper
Code

Convolutional Two-Stream Network Fusion for Video Action Recognition

feichtenhofer/twostreamfusion • CVPR 2016

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information.

Paper
Code

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

piergiaj/latent-subevents • 26 May 2016

In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos.

Paper
Code