Spatio-Temporal Video Grounding

6 papers with code • 3 benchmarks • 3 datasets

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Most implemented papers

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Guaranteer/VidSTG-Dataset CVPR 2020

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

tzhhhh123/HC-STVG 10 Nov 2020

HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.

TubeDETR: Spatio-Temporal Video Grounding with Transformers

antoyang/TubeDETR CVPR 2022

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

jy0205/stcat 27 Sep 2022

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

mbzuai-oryx/video-llava 22 Nov 2023

Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data.

Context-Guided Spatio-Temporal Video Grounding

henglan/cgstvg 3 Jan 2024

The key of CG-STVG lies in two specially designed modules, including instance context generation (ICG), which focuses on discovering visual context information (in both appearance and motion) of the instance, and instance context refinement (ICR), which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context.