Action Segmentation
73 papers with code • 9 benchmarks • 16 datasets
Action Segmentation is a challenging problem in high-level video understanding. In its simplest form, Action Segmentation aims to segment a temporally untrimmed video by time and label each segmented part with one of pre-defined action labels. The results of Action Segmentation can be further used as input to various applications, such as video-to-text and action localization.
Source: TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation
Libraries
Use these libraries to find Action Segmentation models and implementationsDatasets
Subtasks
Latest papers
Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation
This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup.
RF-Next: Efficient Receptive Field Search for Convolutional Neural Networks
Our search scheme exploits both global search to find the coarse combinations and local search to get the refined receptive field combinations further.
Do we really need temporal convolutions in action segmentation?
Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models.
Cross-Enhancement Transformer for Action Segmentation
Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers.
Temporal Alignment Networks for Long-term Video
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.
Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities
Assembly101 is a new procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 "take-apart" toy vehicles.
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos
The generated text prompts are paired with corresponding video clips, and together co-train the text encoder and the video encoder via a contrastive approach.
HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction
We present HOI4D, a large-scale 4D egocentric dataset with rich annotations, to catalyze the research of category-level human-object interaction.
Skeleton-Based Action Segmentation with Multi-Stage Spatial-Temporal Graph Convolutional Neural Networks
State-of-the-art action segmentation approaches use multiple stages of temporal convolutions.
Set-Supervised Action Learning in Procedural Task Videos via Pairwise Order Consistency
We address the problem of set-supervised action learning, whose goal is to learn an action segmentation model using weak supervision in the form of sets of actions occurring in training videos.