Action Recognition In Videos
64 papers with code • 17 benchmarks • 17 datasets
Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.
Libraries
Use these libraries to find Action Recognition In Videos models and implementationsDatasets
Latest papers
ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos
Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples.
HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition
Action recognition in videos poses a challenge due to its high computational cost, especially for Joint Space-Time video transformers (Joint VT).
CAST: Cross-Attention in Space and Time for Video Action Recognition
In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input.
Actor-agnostic Multi-label Action Recognition with Multi-modal Query
Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors.
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).
Dual-path Adaptation from Image to Video Transformers
In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer
To extend our approach to video, we integrate ConvNets with state-of-the-art temporal methods such as Transformer and Recurrent Neural Networks.
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos
We show that it is possible to use a multi-modal model to tackle a task that it was not designed for.