Action Recognition

881 papers with code • 49 benchmarks • 105 datasets

Action Recognition is a computer vision task that involves recognizing human actions in videos or images. The goal is to classify and categorize the actions being performed in the video or image into a predefined set of action classes.

In the video domain, it is an open question whether training an action classification network on a sufficiently large dataset, will give a similar boost in performance when applied to a different temporal task or dataset. The challenges of building video datasets has meant that most popular benchmarks for action recognition are small, having on the order of 10k videos.

Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.

Libraries

Use these libraries to find Action Recognition models and implementations
20 papers
3,888
10 papers
2,991
4 papers
550
See all 8 libraries.

CoFInAl: Enhancing Action Quality Assessment with Coarse-to-Fine Instruction Alignment

zhoukanglei/cofinal_aqa 22 Apr 2024

However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA.

3
22 Apr 2024

Aligning Actions and Walking to LLM-Generated Textual Descriptions

radu1999/walkandtext 18 Apr 2024

For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations.

0
18 Apr 2024

VG4D: Vision-Language Model Goes 4D Video Recognition

shark0-0/vg4d 17 Apr 2024

By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance.

4
17 Apr 2024

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

faceonlive/ai-research 9 Apr 2024

Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples.

152
09 Apr 2024

TIM: A Time Interval Machine for Audio-Visual Action Recognition

faceonlive/ai-research 8 Apr 2024

We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.

152
08 Apr 2024

PREGO: online mistake detection in PRocedural EGOcentric videos

aleflabo/prego 2 Apr 2024

We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos.

6
02 Apr 2024

Disentangled Pre-training for Human-Object Interaction Detection

xingaoli/dp-hoi 2 Apr 2024

Therefore, we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem.

1
02 Apr 2024

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

18
26 Mar 2024

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

zc-alexfan/arctic 25 Mar 2024

We interact with the world with our hands and see it through our own (egocentric) perspective.

224
25 Mar 2024

Understanding Long Videos in One Multimodal Language Model Pass

kahnchana/mvu 25 Mar 2024

In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information.

14
25 Mar 2024