TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Spatio-Temporal Video Grounding	HC-STVG1	CG-STVG	m_vIoU	38.4	# 1
Spatio-Temporal Video Grounding	HC-STVG1	CG-STVG	vIoU@0.3	61.5	# 1
Spatio-Temporal Video Grounding	HC-STVG1	CG-STVG	vIoU@0.5	36.3	# 1
Spatio-Temporal Video Grounding	HC-STVG2	CG-STVG	Val m_vIoU	39.5	# 1
Spatio-Temporal Video Grounding	HC-STVG2	CG-STVG	Val vIoU@0.3	64.5	# 2
Spatio-Temporal Video Grounding	HC-STVG2	CG-STVG	Val vIoU@0.5	36.3	# 1
Spatio-Temporal Video Grounding	VidSTG	CG-STVG	Declarative m_vIoU	34.0	# 1
Spatio-Temporal Video Grounding	VidSTG	CG-STVG	Declarative vIoU@0.3	47.7	# 1
Spatio-Temporal Video Grounding	VidSTG	CG-STVG	Declarative vIoU@0.5	33.1	# 1
Spatio-Temporal Video Grounding	VidSTG	CG-STVG	Interrogative m_vIoU	29.0	# 1
Spatio-Temporal Video Grounding	VidSTG	CG-STVG	Interrogative vIoU@0.3	40.5	# 1
Spatio-Temporal Video Grounding	VidSTG	CG-STVG	Interrogative vIoU@0.5	27.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/context-guided-spatio-temporal-video/spatio-temporal-video-grounding-on-hc-stvg1)](https://paperswithcode.com/sota/spatio-temporal-video-grounding-on-hc-stvg1?p=context-guided-spatio-temporal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/context-guided-spatio-temporal-video/spatio-temporal-video-grounding-on-hc-stvg2)](https://paperswithcode.com/sota/spatio-temporal-video-grounding-on-hc-stvg2?p=context-guided-spatio-temporal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/context-guided-spatio-temporal-video/spatio-temporal-video-grounding-on-vidstg)](https://paperswithcode.com/sota/spatio-temporal-video-grounding-on-vidstg?p=context-guided-spatio-temporal-video)`

Context-Guided Spatio-Temporal Video Grounding

3 Jan 2024 · Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang ·

Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements, current methods easily suffer the distractors or heavy object appearance variations in videos due to insufficient object information from the text, leading to degradation. Addressing this, we propose a novel framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization. The key of CG-STVG lies in two specially designed modules, including instance context generation (ICG), which focuses on discovering visual context information (in both appearance and motion) of the instance, and instance context refinement (ICR), which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context. During grounding, ICG, together with ICR, are deployed at each decoding stage of a Transformer architecture for instance context learning. Particularly, instance context learned from one decoding stage is fed to the next stage, and leveraged as a guidance containing rich and discriminative object feature to enhance the target-awareness in decoding feature, which conversely benefits generating better new instance context for improving localization finally. Compared to existing methods, CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization. In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU on all of them, showing its efficacy. The code will be released at https://github.com/HengLan/CGSTVG.

PDF Abstract

Code

Add Remove Mark official

henglan/cgstvg official

Tasks

Add Remove

Object

Spatio-Temporal Video Grounding

Video Grounding

Datasets

Visual Question Answering

VidSTG HC-STVG2 HC-STVG1

Results from the Paper

Add Remove

Ranked #1 on Spatio-Temporal Video Grounding on HC-STVG1

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Spatio-Temporal Video Grounding	HC-STVG1	CG-STVG	m_vIoU	38.4	# 1	Compare
			vIoU@0.3	61.5	# 1	Compare
			vIoU@0.5	36.3	# 1	Compare
Spatio-Temporal Video Grounding	HC-STVG2	CG-STVG	Val m_vIoU	39.5	# 1	Compare
			Val vIoU@0.3	64.5	# 2	Compare
			Val vIoU@0.5	36.3	# 1	Compare
Spatio-Temporal Video Grounding	VidSTG	CG-STVG	Declarative m_vIoU	34.0	# 1	Compare
			Declarative vIoU@0.3	47.7	# 1	Compare
			Declarative vIoU@0.5	33.1	# 1	Compare
			Interrogative m_vIoU	29.0	# 1	Compare
			Interrogative vIoU@0.3	40.5	# 1	Compare
			Interrogative vIoU@0.5	27.5	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Context-Guided Spatio-Temporal Video Grounding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove