Action Recognition
881 papers with code • 49 benchmarks • 105 datasets
Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.
In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.
Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.
Libraries
Use these libraries to find Action Recognition models and implementationsDatasets
Subtasks
- Action Recognition In Videos
- 3D Action Recognition
- Self-Supervised Action Recognition
- Few Shot Action Recognition
- Few Shot Action Recognition
- Fine-grained Action Recognition
- Action Triplet Recognition
- Open Set Action Recognition
- Micro-Action Recognition
- Weakly-Supervised Action Recognition
- Atomic action recognition
- Animal Action Recognition
- Transportation Mode Detection
- Open Vocabulary Action Recognition
- Action Recognition In Still Images
Latest papers with no code
In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition
Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions.
Exploring Explainability in Video Action Recognition
To address these, we introduce Video-TCAV, by building on TCAV for Image Classification tasks, which aims to quantify the importance of specific concepts in the decision-making process of Video Action Recognition models.
Multimodal Attack Detection for Action Recognition Models
In addition, we analyze our method's real-time performance with different hardware setups to demonstrate its potential as a practical defense mechanism.
MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition
Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer).
Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos
These spatial features then undergo intermediate temporal modeling facilitated by the Mamba block before progressing to the encoder section, which comprises vanilla upsampling Shift S-GCN blocks.
Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition
However, previous works focus on establishing the bridges between the known skeleton representation space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories.
O-TALC: Steps Towards Combating Oversegmentation within Online Action Segmentation
In order to facilitate online action segmentation on a stream of incoming video data, we introduce two methods for improved training and inference of backbone action recognition models, allowing them to be deployed directly for online frame level classification.
An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video
(3) We achieve the same performance with only 10% of the original data for training as with all of the original data from the real-world dataset, and a better performance on In-the-wild videos, by employing our data augmentation techniques.
X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model
The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making.
Learning Correlation Structures for Vision Transformers
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.