Temporal Action Localization

422 papers with code • 14 benchmarks • 42 datasets

Temporal Action Localization aims to detect activities in the video stream and output beginning and end timestamps. It is closely related to Temporal Action Proposal Generation.

Benchmarks

Add a Result

These leaderboards are used to track progress in Temporal Action Localization

Dataset	Best Model	Compare
THUMOS’14	AdaTAD (VideoMAEv2-giant)	See all
ActivityNet-1.3	ActionMamba (InternVideo2-6B)	See all
HACS	ActionMamba(InternVideo2-6B)	See all
CrossTask	VideoCLIP	See all
MultiTHUMOS	TriDet (VideoMAEv2)	See all
FineAction	ActionMamba(InternVideo2-6B)	See all
EPIC-KITCHENS-100	AdaTAD (verb, VideoMAE-L)	See all
MUSES	TemporalMaxer	See all
MEXaction2	S-CNN	See all
ActivityNet-1.2	DeepMetricLearner	See all
THUMOS'14	AdaTAD (VideoMAEv2-giant)	See all
Ego4D MQ val	ActionFormer (SlowFast+Omnivore+EgoVLP)	See all
Ego4D MQ test	ActionFormer (SlowFast+Omnivore+EgoVLP)	See all
THUMOS14	BasicTAD (R50-SlowOnly)	See all

Show all 14 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Temporal Action Localization models and implementations

open-mmlab/mmaction2

9 papers

3,892

yjxiong/caffe

4 papers

550

towhee-io/towhee

3 papers

2,991

bryanyzhu/two-stream-pytorch

3 papers

554

See all 12 libraries.

Datasets

Subtasks

Temporal Action Proposal Generation

Activity Recognition In Videos

Action Recognition In Still Images

Latest papers

Most implemented Social Latest No code

Test-Time Zero-Shot Temporal Action Localization

benedettaliberatori/t3al • • 8 Apr 2024

To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL).

08 Apr 2024

Paper
Code

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

yingsen1/unimd • • 7 Apr 2024

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos.

07 Apr 2024

Paper
Code

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

ttgeng233/UniAV • • 4 Apr 2024

Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL).

04 Apr 2024

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo • • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

921

22 Mar 2024

Paper
Code

A Lie Group Approach to Riemannian Batch Normalization

gitzh-chen/liebn • • 17 Mar 2024

Using the deformation concept, we generalize the existing Lie groups on SPD manifolds into three families of parameterized Lie groups.

17 Mar 2024

Paper
Code

Skeleton-Based Human Action Recognition with Noisy Labels

xuyizdby/noiseerasar • 15 Mar 2024

In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark.

15 Mar 2024

Paper
Code

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

opengvlab/video-mamba-suite • • 14 Mar 2024

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

107

14 Mar 2024

Paper
Code

Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping

Trustworthy-AI-Group/TransferAttack • • 6 Feb 2024

Adversarial examples generated by a surrogate model typically exhibit limited transferability to unknown target systems.

136

06 Feb 2024

Paper
Code

Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

buptsjzhang/std-cl • 23 Dec 2023

Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs.

23 Dec 2023

Paper
Code

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach

qinying-liu/case • • ICCV 2023

It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background.

100

21 Dec 2023

Paper
Code

Temporal Action Localization

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result