Action Recognition

883 papers with code • 49 benchmarks • 105 datasets

Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.

In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.

Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.

Libraries

Use these libraries to find Action Recognition models and implementations
20 papers
3,912
10 papers
3,001
4 papers
550
See all 8 libraries.

Latest papers with no code

Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos

no code yet • 11 Apr 2024

These spatial features then undergo intermediate temporal modeling facilitated by the Mamba block before progressing to the encoder section, which comprises vanilla upsampling Shift S-GCN blocks.

Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition

no code yet • 11 Apr 2024

However, previous works focus on establishing the bridges between the known skeleton representation space and semantic descriptions space at the coarse-grained level for recognizing unknown action categories, ignoring the fine-grained alignment of these two spaces, resulting in suboptimal performance in distinguishing high-similarity action categories.

O-TALC: Steps Towards Combating Oversegmentation within Online Action Segmentation

no code yet • 10 Apr 2024

In order to facilitate online action segmentation on a stream of incoming video data, we introduce two methods for improved training and inference of backbone action recognition models, allowing them to be deployed directly for online frame level classification.

An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video

no code yet • 10 Apr 2024

Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications.

X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model

no code yet • 7 Apr 2024

The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making.

Learning Correlation Structures for Vision Transformers

no code yet • 5 Apr 2024

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.

PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

no code yet • 5 Apr 2024

PhysPT exploits a Transformer encoder-decoder backbone to effectively learn human dynamics in a self-supervised manner.

Koala: Key frame-conditioned long video-LLM

no code yet • 5 Apr 2024

Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.

Multi-Scale Spatial-Temporal Self-Attention Graph Convolutional Networks for Skeleton-based Action Recognition

no code yet • 3 Apr 2024

Skeleton-based gesture recognition methods have achieved high success using Graph Convolutional Network (GCN).

Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery

no code yet • 2 Apr 2024

In this article, we explore the potential of zero-shot Large Multimodal Models (LMMs) in the domain of drone perception.