Temporal Action Localization
422 papers with code • 14 benchmarks • 42 datasets
Temporal Action Localization aims to detect activities in the video stream and output beginning and end timestamps. It is closely related to Temporal Action Proposal Generation.
Libraries
Use these libraries to find Temporal Action Localization models and implementationsDatasets
Subtasks
Latest papers
Test-Time Zero-Shot Temporal Action Localization
To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL).
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos.
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL).
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
A Lie Group Approach to Riemannian Batch Normalization
Using the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups.
Skeleton-Based Human Action Recognition with Noisy Labels
In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark.
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping
Adversarial examples generated by a surrogate model typically exhibit limited transferability to unknown target systems.
Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition
Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs.
Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach
It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background.