Action Detection
233 papers with code • 11 benchmarks • 33 datasets
Action Detection aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.
Libraries
Use these libraries to find Action Detection models and implementationsDatasets
Subtasks
Most implemented papers
From Recognition to Prediction: Analysis of Human Action and Trajectory Prediction in Video
With the advancement in computer vision deep learning, systems now are able to analyze an unprecedented amount of rich visual information from videos to enable applications such as autonomous driving, socially-aware robot assistant and public safety monitoring.
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
This thesis explore different approaches using Convolutional and Recurrent Neural Networks to classify and temporally localize activities on videos, furthermore an implementation to achieve it has been proposed.
An End-to-End Architecture for Keyword Spotting and Voice Activity Detection
We propose a single neural network architecture for two tasks: on-line keyword spotting and voice activity detection.
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection
We address the problem of activity detection in continuous, untrimmed video streams.
Fine-grained Activity Recognition in Baseball Videos
In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection.
rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method
In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity.
pyannote.audio: neural building blocks for speaker diarization
We introduce pyannote. audio, an open-source toolkit written in Python for speaker diarization.
A Multigrid Method for Efficiently Training Video Models
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context.
Context-Aware RCNN: A Baseline for Action Detection in Videos
In this work, we first empirically find the recognition accuracy is highly correlated with the bounding box size of an actor, and thus higher resolution of actors contributes to better performance.