Action Classification
228 papers with code • 24 benchmarks • 30 datasets
Image source: The Kinetics Human Action Video Dataset
Libraries
Use these libraries to find Action Classification models and implementationsDatasets
Most implemented papers
Long-Term Feature Banks for Detailed Video Understanding
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment
Can performance on the task of action quality assessment (AQA) be improved by exploiting a description of the action and its quality?
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture
FlowGAN generates optical flow, which contains only the edge and motion of the videos to be begerated.
Weakly Supervised Action Localization by Sparse Temporal Pooling Network
We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks.
Timeception for Complex Action Recognition
This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued.
VideoBERT: A Joint Model for Video and Language Representation Learning
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.