Action Recognition

878 papers with code • 49 benchmarks • 105 datasets

Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.

In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.

Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Recognition

Dataset	Best Model	Compare
Something-Something V2	InternVideo2-6B	See all
UCF101	VideoMAE V2-g	See all
HMDB-51	VideoMAE V2-g	See all
Something-Something V1	InternVideo	See all
AVA v2.2	LART (Hiera-H, K700 PT+FT)	See all
EPIC-KITCHENS-100	Avion (ViT-L)	See all
NTU RGB+D	PoseC3D (RGB + Pose)	See all
NTU RGB+D 120	PoseC3D (RGB + Pose)	See all
Diving-48	Video-FocalNet-B	See all
ActivityNet	Text4Vis (w/ ViT-L)	See all
AVA v2.1	STAR/L	See all
THUMOS’14	BMN	See all
Sports-1M	ip-CSN-152 (RGB)	See all
HACS	InternVideo2-6B	See all
Charades-Ego	LaViLa (Finetuned, TimeSformer-L)	See all
HAA500	TSN	See all
BAR	DebiAN	See all
UAV-Human	PMI Sampler	See all
Volleyball	PoseC3D (Pose Only)	See all
Real Life Violence Situations Dataset	DeVTr	See all
RareAct	🦩 Flamingo	See all
Jester (Gesture Recognition)	DirecFormer	See all
miniSports	IF+MD+RGB-R (ResNet-18)	See all
IRD	OHA-GCN (Two stream; HP + OHP-hands + informative samples)	See all
ICVL-4	OHA-GCN (Two stream; HP + OHP-hands + informative samples)	See all
UCF-101	DMC-Net (ResNet-18)	See all
Mimetics	JMRN	See all
Drone-Action	FAR	See all
Okutama-Action	PLAR with bbox (Ours)	See all
Animal Kingdom	MSQNet	See all
Charades	MSQNet	See all
VIRAT Ground 2.0	DHCM	See all
ActionNet-VE	Baseline	See all
UTD-MHAD	Action Machine (RGB only)	See all
EgoGesture	TSM+W3	See all
EPIC-KITCHENS-55	TSM+W3 - full res	See all
HMDB51	MSQNet	See all
MECCANO	SlowFast	See all
Win-Fail Action Understanding	2DCNN+TRN	See all
MTL-AQA	C3D-AVG	See all
UCF 101	R2+1D-BERT	See all
Penn Action	STAR-Transformer (RGB + Pose)	See all
Skeleton-Mimetics	Structured Keypoint Pooling	See all
RoCoG-v2	AZTR (Ours)	See all
NEC Drone	FAR	See all
UAV Human	FAR	See all
THUMOS14	MSQNet	See all
Hockey	MSQNet	See all
N-UCLA	DVANet	See all

Show all 49 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Recognition models and implementations

open-mmlab/mmaction2

20 papers

3,887

towhee-io/towhee

10 papers

2,986

yjxiong/caffe

4 papers

550

rwightman/pytorch-image-models

3 papers

29,713

See all 8 libraries.

Datasets

Subtasks

Few Shot Action Recognition

Fine-grained Action Recognition

Action Triplet Recognition

Open Set Action Recognition

Micro-Action Recognition

Weakly-Supervised Action Recognition

Atomic action recognition

Animal Action Recognition

Transportation Mode Detection

Open Vocabulary Action Recognition

Action Recognition In Still Images

Most implemented papers

Most implemented Social Latest No code

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

yjxiong/temporal-segment-networks • • 2 Aug 2016

The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.

Paper
Code

SlowFast Networks for Video Recognition

facebookresearch/SlowFast • • ICCV 2019

We present SlowFast networks for video recognition.

Paper
Code

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

PaddlePaddle/models • • ICCV 2019

To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map.

Paper
Code

Video Swin Transformer

SwinTransformer/Video-Swin-Transformer • • CVPR 2022

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.

Paper
Code

TSM: Temporal Shift Module for Efficient Video Understanding

MIT-HAN-LAB/temporal-shift-module • • ICCV 2019

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.

Paper
Code

Is Space-Time Attention All You Need for Video Understanding?

facebookresearch/TimeSformer • • 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

Paper
Code

Temporal Segment Networks for Action Recognition in Videos

yjxiong/temporal-segment-networks • • 8 May 2017

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Paper
Code

Unsupervised Learning of Video Representations using LSTMs

emansim/unsupervised-videos • 16 Feb 2015

We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.

Paper
Code

Graph-Based Global Reasoning Networks

facebookresearch/GloRe • • CVPR 2019

In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.

Paper
Code

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

tensorflow/models • • CVPR 2018

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Paper
Code

Action Recognition

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result