Action Recognition
878 papers with code • 49 benchmarks • 105 datasets
Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.
In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.
Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.
Libraries
Use these libraries to find Action Recognition models and implementationsDatasets
Subtasks
- Action Recognition In Videos
- 3D Action Recognition
- Self-Supervised Action Recognition
- Few Shot Action Recognition
- Few Shot Action Recognition
- Fine-grained Action Recognition
- Action Triplet Recognition
- Open Set Action Recognition
- Micro-Action Recognition
- Weakly-Supervised Action Recognition
- Atomic action recognition
- Animal Action Recognition
- Transportation Mode Detection
- Open Vocabulary Action Recognition
- Action Recognition In Still Images
Most implemented papers
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.
SlowFast Networks for Video Recognition
We present SlowFast networks for video recognition.
BMN: Boundary-Matching Network for Temporal Action Proposal Generation
To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map.
Video Swin Transformer
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.
TSM: Temporal Shift Module for Efficient Video Understanding
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Is Space-Time Attention All You Need for Video Understanding?
We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
Temporal Segment Networks for Action Recognition in Videos
Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
Unsupervised Learning of Video Representations using LSTMs
We further evaluate the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.
Graph-Based Global Reasoning Networks
In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.