Action Recognition In Videos

64 papers with code • 17 benchmarks • 17 datasets

Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Recognition In Videos

Dataset	Best Model	Compare
Jester (Gesture Recognition)	CPNet Res34, 5 CP	See all
UCF101	STM (ImageNet+Kinetics pretrain)	See all
Something-Something V2	CAST-B/16	See all
Something-Something V1	STM (16 frames, ImageNet pretraining)	See all
Kinetics-400	CAST-B/16	See all
PKU-MMD	MMNet	See all
Sports-1M	G-Blend	See all
FS-Something-Something V2-Small	ITANet	See all
FS-Something-Something V2-Full	ITANet	See all
THUMOS’14	Single-stream R-C3D (two-way buffer)	See all
AVA v2.2	YOWO+LFB*	See all
HMDB-51	STM (ImageNet+Kinetics pretrain)	See all
AVA v2.1	YOWO+LFB*	See all
Kinetics-600	Florence	See all
ActivityNet	LSTM + Pretrained on YT-8M	See all
NTU RGB+D	2D-3D-Softargmax (RGB only)	See all
miniSports	G-Blend	See all

Show all 17 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Recognition In Videos models and implementations

open-mmlab/mmaction2

4 papers

3,912

yjxiong/caffe

3 papers

550

towhee-io/towhee

2 papers

3,001

MichiganCOG/M-PACT

2 papers

106

See all 5 libraries.

Datasets

Subtasks

Action Anticipation

Most implemented papers

Most implemented Social Latest No code

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

wei-tim/YOWO • • 15 Nov 2019

YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation.

Paper
Code

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

VisionLearningGroup/R-C3D • ICCV 2017

We address the problem of activity detection in continuous, untrimmed video streams.

Paper
Code

What Makes Training Multi-Modal Classification Networks Hard?

facebookresearch/R2Plus1D • • CVPR 2020

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart.

Paper
Code

Gating Revisited: Deep Multi-layer RNNs That Can Be Trained

0zgur0/STAR_Network • • 25 Nov 2019

We propose a new STAckable Recurrent cell (STAR) for recurrent neural networks (RNNs), which has fewer parameters than widely used LSTM and GRU while being more robust against vanishing or exploding gradients.

Paper
Code

Action Recognition using Visual Attention

kracwarlock/action-recognition-visual-attention • 12 Nov 2015

We propose a soft attention based model for the task of action recognition in videos.

Paper
Code

2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning

dluvizon/deephar • CVPR 2018

Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature.

Paper
Code

Resource Efficient 3D Convolutional Neural Networks

okankop/Efficient-3DCNNs • • 4 Apr 2019

Recently, convolutional neural networks with 3D kernels (3D CNNs) have been very popular in computer vision community as a result of their superior ability of extracting spatio-temporal features within video frames compared to 2D CNNs.

Paper
Code