Action Recognition In Videos
64 papers with code • 17 benchmarks • 17 datasets
Action Recognition in Videos is a task in computer vision and pattern recognition where the goal is to identify and categorize human actions performed in a video sequence. The task involves analyzing the spatiotemporal dynamics of the actions and mapping them to a predefined set of action classes, such as running, jumping, or swimming.
Libraries
Use these libraries to find Action Recognition In Videos models and implementationsDatasets
Most implemented papers
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation.
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection
We address the problem of activity detection in continuous, untrimmed video streams.
What Makes Training Multi-Modal Classification Networks Hard?
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart.
Gating Revisited: Deep Multi-layer RNNs That Can Be Trained
We propose a new STAckable Recurrent cell (STAR) for recurrent neural networks (RNNs), which has fewer parameters than widely used LSTM and GRU while being more robust against vanishing or exploding gradients.
Action Recognition using Visual Attention
We propose a soft attention based model for the task of action recognition in videos.
2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning
Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature.
Resource Efficient 3D Convolutional Neural Networks
Recently, convolutional neural networks with 3D kernels (3D CNNs) have been very popular in computer vision community as a result of their superior ability of extracting spatio-temporal features within video frames compared to 2D CNNs.
Learning Video Representations from Correspondence Proposals
In particular, it can effectively learn representations for videos by mixing appearance and long-range motion with an RGB-only input.
IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos
Most current action recognition methods heavily rely on appearance information by taking an RGB sequence of entire image regions as input.
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.