Action Recognition In Videos

64 papers with code • 17 benchmarks • 17 datasets

Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Recognition In Videos

Dataset	Best Model	Compare
Jester (Gesture Recognition)	CPNet Res34, 5 CP	See all
UCF101	STM (ImageNet+Kinetics pretrain)	See all
Something-Something V2	CAST-B/16	See all
Something-Something V1	STM (16 frames, ImageNet pretraining)	See all
Kinetics-400	CAST-B/16	See all
PKU-MMD	MMNet	See all
Sports-1M	G-Blend	See all
FS-Something-Something V2-Small	ITANet	See all
FS-Something-Something V2-Full	ITANet	See all
THUMOS’14	Single-stream R-C3D (two-way buffer)	See all
AVA v2.2	YOWO+LFB*	See all
HMDB-51	STM (ImageNet+Kinetics pretrain)	See all
AVA v2.1	YOWO+LFB*	See all
Kinetics-600	Florence	See all
ActivityNet	LSTM + Pretrained on YT-8M	See all
NTU RGB+D	2D-3D-Softargmax (RGB only)	See all
miniSports	G-Blend	See all

Show all 17 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Recognition In Videos models and implementations

open-mmlab/mmaction2

4 papers

3,866

yjxiong/caffe

3 papers

548

towhee-io/towhee

2 papers

2,972

MichiganCOG/M-PACT

2 papers

106

See all 5 libraries.

Datasets

Subtasks

Action Anticipation

Latest papers

Most implemented Social Latest No code

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

faceonlive/ai-research • 9 Apr 2024

Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples.

131

09 Apr 2024

Paper
Code

HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

dun-research/haltingvt • • 10 Jan 2024

Action recognition in videos poses a challenge due to its high computational cost, especially for Joint Space-Time video transformers (Joint VT).

10 Jan 2024

Paper
Code

CAST: Cross-Attention in Space and Time for Video Action Recognition

khu-vll/cast • • NeurIPS 2023

In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input.

30 Nov 2023

Paper
Code

Actor-agnostic Multi-label Action Recognition with Multi-modal Query

mondalanindya/msqnet • • 20 Jul 2023

Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors.

20 Jul 2023

Paper
Code

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

facebookresearch/hiera • • 1 Jun 2023

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.

691

01 Jun 2023

Paper
Code

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

OpenGVLab/VideoMAEv2 • • CVPR 2023

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

394

29 Mar 2023

Paper
Code

Dual-path Adaptation from Image to Video Transformers

park-jungin/dualpath • • CVPR 2023

In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.

17 Mar 2023

Paper
Code