Action Detection

235 papers with code • 11 benchmarks • 33 datasets

Action Detection aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Action Detection

Dataset	Best Model	Compare
J-HMDB	HIT	See all
Charades	TTM	See all
UCF101-24	STAR/L	See all
Multi-THUMOS	MLAD	See all
UCF Sports	T-CNN	See all
THUMOS' 14	MAT (Ours) Trans	See all
TSU	PDAN	See all
TTStroke-21 ME22	STCNN-V2 (Vote decision)	See all
TTStroke-21 ME21	STCNN	See all
MultiSports	HIT	See all
MultiTHUMOS	PAT	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Action Detection models and implementations

open-mmlab/mmaction2

6 papers

3,916

alibaba-damo-academy/FunASR

3 papers

3,417

Frostinassiky/gtad

3 papers

216

towhee-io/towhee

2 papers

3,003

See all 6 libraries.

Datasets

Subtasks

Audio-Visual Active Speaker Detection

Fine-Grained Action Detection

Action Triplet Detection

Few Shot Temporal Action Localization

Multiple Action Detection

Latest papers with no code

Most implemented Social Latest No code

Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

no code yet • 16 Jan 2024

The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework.

Paper
Add Code

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

no code yet • 7 Jan 2024

Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal.

Paper
Add Code

Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

no code yet • 27 Dec 2023

Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning.

Paper
Add Code

SADA: Semantic adversarial unsupervised domain adaptation for Temporal Action Localization

no code yet • 20 Dec 2023

Temporal Action Localization (TAL) is a complex task that poses relevant challenges, particularly when attempting to generalize on new -- unseen -- domains in real-world applications.

Paper
Add Code

Spatiotemporal Event Graphs for Dynamic Scene Understanding

no code yet • 11 Dec 2023

In this thesis, we present a series of frameworks for dynamic scene understanding starting from road event detection from an autonomous driving perspective to complex video activity detection, followed by continual learning approaches for the life-long learning of the models.

Paper
Add Code

Low-power, Continuous Remote Behavioral Localization with Event Cameras

no code yet • 6 Dec 2023

However, observing wild species at remote locations remains a challenging task due to difficult lighting conditions and constraints on power supply and data storage.

Paper
Add Code

Towards More Practical Group Activity Detection: A New Benchmark and Model

no code yet • 5 Dec 2023

Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video.

Paper
Add Code

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

no code yet • 4 Dec 2023

To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels.

Paper
Add Code

SPIRE-SIES: A Spontaneous Indian English Speech Corpus

no code yet • 1 Dec 2023

Transcripts for 23 hours is generated and validated which can serve as a spontaneous speech ASR benchmark.

Paper
Add Code

ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal Action Localization

no code yet • 27 Nov 2023

This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set.

Paper
Add Code

Action Detection

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result