Action Recognition In Videos
64 papers with code • 17 benchmarks • 17 datasets
Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.
Libraries
Use these libraries to find Action Recognition In Videos models and implementationsDatasets
Latest papers
MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos
Upon aggregating the results of multiple modalities, our method is found to outperform state-of-the-art approaches on six evaluation protocols of the five datasets; thus, the proposed MMNet can effectively capture mutually complementary features in different RGB-D video modalities and provide more discriminative features for HAR.
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition
Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results.
Self-supervised Video Transformer
To the best of our knowledge, the proposed approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT).
Florence: A New Foundation Model for Computer Vision
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
Logsig-RNN: a novel network for robust and efficient skeleton-based action recognition
In this paper, we propose a novel module, namely Logsig-RNN, which is the combination of the log-signature layer and recurrent type neural networks (RNNs).
ActionCLIP: A New Paradigm for Video Action Recognition
Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune".
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting
Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning.
Space-time Mixing Attention for Video Transformer
In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model.
Multimodal Fusion via Teacher-Student Network for Indoor Action Recognition
In our TSMF, we utilize a teacher network to transfer the structural knowledge of the skeleton modality to a student network for the RGB modality.
Learning Implicit Temporal Alignment for Few-shot Video Classification
Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications.