Action Classification
227 papers with code • 24 benchmarks • 30 datasets
Image source: The Kinetics Human Action Video Dataset
Libraries
Use these libraries to find Action Classification models and implementationsDatasets
Most implemented papers
ECO: Efficient Convolutional Network for Online Video Understanding
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data.
Temporal Relational Reasoning in Videos
Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species.
Representation Flow for Action Recognition
Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.
Revisiting 3D ResNets for Video Recognition
A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.
Masked Feature Prediction for Self-Supervised Visual Pre-Training
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models.
CoCa: Contrastive Captioners are Image-Text Foundation Models
We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
In this study, we focus on transferring knowledge for video classification tasks.
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.
TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.