Action Recognition
881 papers with code • 49 benchmarks • 105 datasets
Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.
In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.
Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.
Libraries
Use these libraries to find Action Recognition models and implementationsDatasets
Subtasks
- Action Recognition In Videos
- 3D Action Recognition
- Self-Supervised Action Recognition
- Few Shot Action Recognition
- Few Shot Action Recognition
- Fine-grained Action Recognition
- Action Triplet Recognition
- Open Set Action Recognition
- Micro-Action Recognition
- Weakly-Supervised Action Recognition
- Atomic action recognition
- Animal Action Recognition
- Transportation Mode Detection
- Open Vocabulary Action Recognition
- Action Recognition In Still Images
Latest papers
DeGCN: Deformable Graph Convolutional Networks for Skeleton-Based Action Recognition
Graph convolutional networks (GCN) have recently been studied to exploit the graph topology of the human body for skeleton-based action recognition.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
GCN-DevLSTM: Path Development for Skeleton-Based Action Recognition
Skeleton-based action recognition (SAR) in videos is an important but challenging task in computer vision.
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.
A Lie Group Approach to Riemannian Batch Normalization
Using the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups.
Skeleton-Based Human Action Recognition with Noisy Labels
In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark.
SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types.
EventRPG: Event Data Augmentation with Relevance Propagation Guidance
Based on this, we propose EventRPG, which leverages relevance propagation on the spiking neural network for more efficient augmentation.
On the Utility of 3D Hand Poses for Action Recognition
3D hand poses are an under-explored modality for action recognition.
Real-Time Multimodal Cognitive Assistant for Emergency Medical Services
Emergency Medical Services (EMS) responders often operate under time-sensitive conditions, facing cognitive overload and inherent risks, requiring essential skills in critical thinking and rapid decision-making.