Action Detection
233 papers with code • 11 benchmarks • 33 datasets
Action Detection aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.
Libraries
Use these libraries to find Action Detection models and implementationsDatasets
Subtasks
Most implemented papers
Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization
End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once.
Temporal Action Localization with Enhanced Instant Discriminability
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
Single Shot Temporal Action Detection
The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step.
Learning Latent Super-Events to Detect Multiple Activities in Videos
In this paper, we introduce the concept of learning latent super-events from activity videos, and present how it benefits activity detection in continuous videos.
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
A total of 6, 637 temporal annotations are automatically parsed from online match reports at a one minute resolution for three main classes of events (Goal, Yellow/Red Card, and Substitution).
Temporal Recurrent Networks for Online Action Detection
Most work on temporal action detection is formulated as an offline problem, in which the start and end times of actions are determined after the entire video is fully observed.
Learning Motion in Feature Space: Locally-Consistent Deformable Convolution Networks for Fine-Grained Action Detection
Fine-grained action detection is an important task with numerous applications in robotics and human-computer interaction.
Actor Conditioned Attention Maps for Video Action Detection
While observing complex events with multiple actors, humans do not assess each actor separately, but infer from the context.
Personal VAD: Speaker-Conditioned Voice Activity Detection
In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level.
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding
Videos capture events that typically contain multiple sequential, and simultaneous, actions even in the span of only a few seconds.