TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Moment Retrieval	Charades-STA	CG-DETR	R@1 IoU=0.5	58.44	# 8
Moment Retrieval	Charades-STA	CG-DETR	R@1 IoU=0.7	36.34	# 8
Moment Retrieval	QVHighlights	CG-DETR	mAP	42.86	# 10
Moment Retrieval	QVHighlights	CG-DETR	R@1 IoU=0.5	65.43	# 5
Moment Retrieval	QVHighlights	CG-DETR	R@1 IoU=0.7	48.38	# 7
Moment Retrieval	QVHighlights	CG-DETR	mAP@0.5	64.51	# 7
Moment Retrieval	QVHighlights	CG-DETR	mAP@0.75	42.77	# 9
Highlight Detection	QVHighlights	CG-DETR (w/ PT)	mAP	40.71	# 3
Highlight Detection	QVHighlights	CG-DETR (w/ PT)	Hit@1	66.60	# 2
Highlight Detection	QVHighlights	CG-DETR	mAP	40.33	# 5
Highlight Detection	QVHighlights	CG-DETR	Hit@1	66.21	# 4
Moment Retrieval	QVHighlights	CG-DETR (w/ PT)	mAP	47.97	# 1
Moment Retrieval	QVHighlights	CG-DETR (w/ PT)	R@1 IoU=0.5	68.48	# 1
Moment Retrieval	QVHighlights	CG-DETR (w/ PT)	R@1 IoU=0.7	53.11	# 1
Moment Retrieval	QVHighlights	CG-DETR (w/ PT)	mAP@0.5	69.40	# 1
Moment Retrieval	QVHighlights	CG-DETR (w/ PT)	mAP@0.75	49.12	# 1
Natural Language Moment Retrieval	TACoS	CG-DETR	R@1,IoU=0.3	52.23	# 2
Natural Language Moment Retrieval	TACoS	CG-DETR	R@1,IoU=0.5	39.61	# 2
Natural Language Moment Retrieval	TACoS	CG-DETR	R@1,IoU=0.7	22.23	# 3
Natural Language Moment Retrieval	TACoS	CG-DETR	mIoU	36.48	# 2
Highlight Detection	TvSum	CG-DETR	mAP	86.8	# 1
Highlight Detection	YouTube Highlights	CG-DETR	mAP	75.9	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/correlation-guided-query-dependency/moment-retrieval-on-qvhighlights)](https://paperswithcode.com/sota/moment-retrieval-on-qvhighlights?p=correlation-guided-query-dependency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/correlation-guided-query-dependency/highlight-detection-on-tvsum)](https://paperswithcode.com/sota/highlight-detection-on-tvsum?p=correlation-guided-query-dependency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/correlation-guided-query-dependency/natural-language-moment-retrieval-on-tacos)](https://paperswithcode.com/sota/natural-language-moment-retrieval-on-tacos?p=correlation-guided-query-dependency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/correlation-guided-query-dependency/highlight-detection-on-youtube-highlights)](https://paperswithcode.com/sota/highlight-detection-on-youtube-highlights?p=correlation-guided-query-dependency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/correlation-guided-query-dependency/highlight-detection-on-qvhighlights)](https://paperswithcode.com/sota/highlight-detection-on-qvhighlights?p=correlation-guided-query-dependency)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/correlation-guided-query-dependency/moment-retrieval-on-charades-sta)](https://paperswithcode.com/sota/moment-retrieval-on-charades-sta?p=correlation-guided-query-dependency)`

Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

15 Nov 2023 · WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo ·

Video Temporal Grounding is to identify specific moments or highlights from a video corresponding to textual descriptions. Typical approaches in temporal grounding treat all video clips equally during the encoding process regardless of their semantic relevance with the text query. Therefore, we propose Correlation-Guided DEtection TRansformer(CG-DETR), exploring to provide clues for query-associated video clips within the cross-modal attention. First, we design an adaptive cross-attention with dummy tokens. Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all words equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we exploit the moment-specific characteristics and combine them with the context of each video to form a moment-adaptive saliency detector. By exploiting the degrees of text engagement in each video clip, it precisely measures the highlightness of each clip. CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding.

PDF Abstract

Code

Add Remove Mark official

wjun0830/cgdetr official

wjun0830/qd-detr

168

Tasks

Add Remove

Highlight Detection

Moment Retrieval

Natural Language Moment Retrieval

Representation Learning

Sentence

Datasets

Charades-STA TVSum

QVHighlights TACoS Multi-Level Corpus

Results from the Paper

Edit

Ranked #1 on Highlight Detection on TvSum

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Moment Retrieval	Charades-STA	CG-DETR	R@1 IoU=0.5	58.44	# 8	Compare
Moment Retrieval	Charades-STA	CG-DETR	R@1 IoU=0.7	36.34	# 8	Compare
Moment Retrieval	QVHighlights	CG-DETR	mAP	42.86	# 10	Compare
			R@1 IoU=0.5	65.43	# 5	Compare
			R@1 IoU=0.7	48.38	# 7	Compare
			mAP@0.5	64.51	# 7	Compare
			mAP@0.75	42.77	# 9	Compare
Highlight Detection	QVHighlights	CG-DETR (w/ PT)	mAP	40.71	# 3	Compare
Highlight Detection	QVHighlights	CG-DETR (w/ PT)	Hit@1	66.60	# 2	Compare
Highlight Detection	QVHighlights	CG-DETR	mAP	40.33	# 5	Compare
Highlight Detection	QVHighlights	CG-DETR	Hit@1	66.21	# 4	Compare
Moment Retrieval	QVHighlights	CG-DETR (w/ PT)	mAP	47.97	# 1	Compare
			R@1 IoU=0.5	68.48	# 1	Compare
			R@1 IoU=0.7	53.11	# 1	Compare
			mAP@0.5	69.40	# 1	Compare
			mAP@0.75	49.12	# 1	Compare
Natural Language Moment Retrieval	TACoS	CG-DETR	R@1,IoU=0.3	52.23	# 2	Compare
			R@1,IoU=0.5	39.61	# 2	Compare
			R@1,IoU=0.7	22.23	# 3	Compare
			mIoU	36.48	# 2	Compare
Highlight Detection	TvSum	CG-DETR	mAP	86.8	# 1	Compare
Highlight Detection	YouTube Highlights	CG-DETR	mAP	75.9	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove