Action Classification
225 papers with code • 23 benchmarks • 30 datasets
Image source: The Kinetics Human Action Video Dataset
Libraries
Use these libraries to find Action Classification models and implementationsDatasets
Most implemented papers
Is Space-Time Attention All You Need for Video Understanding?
We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
The Kinetics Human Action Video Dataset
We describe the DeepMind Kinetics human action video dataset.
Temporal Segment Networks for Action Recognition in Videos
Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
Graph-Based Global Reasoning Networks
In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.
X3D: Expanding Architectures for Efficient Video Recognition
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth.
ViViT: A Video Vision Transformer
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Two-Stream Convolutional Networks for Action Recognition in Videos
Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art.
Video Classification with Channel-Separated Convolutional Networks
It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks.
Multiscale Vision Transformers
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
ECO: Efficient Convolutional Network for Online Video Understanding
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.