Action Classification
227 papers with code • 24 benchmarks • 30 datasets
Image source: The Kinetics Human Action Video Dataset
Libraries
Use these libraries to find Action Classification models and implementationsDatasets
Latest papers
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Open-Vocabulary Video Relation Extraction
A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions.
CAST: Cross-Attention in Space and Time for Video Action Recognition
In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input.
Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living
To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential.
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video.
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i. e., introduce zero extra cost to the adapted models during inference).
MOFO: MOtion FOcused Self-Supervision for Video Understanding
Despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos.
Progression-Guided Temporal Action Detection in Videos
The framework locates actions in videos by detecting the action evolution process.
Temporally-Adaptive Models for Efficient Video Understanding
Spatial convolutions are extensively used in numerous deep video models.
Actor-agnostic Multi-label Action Recognition with Multi-modal Query
Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors.