Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

15 Nov 2023  ยท  WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo ยท

Video Temporal Grounding is to identify specific moments or highlights from a video corresponding to textual descriptions. Typical approaches in temporal grounding treat all video clips equally during the encoding process regardless of their semantic relevance with the text query. Therefore, we propose Correlation-Guided DEtection TRansformer(CG-DETR), exploring to provide clues for query-associated video clips within the cross-modal attention. First, we design an adaptive cross-attention with dummy tokens. Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all words equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we exploit the moment-specific characteristics and combine them with the context of each video to form a moment-adaptive saliency detector. By exploiting the degrees of text engagement in each video clip, it precisely measures the highlightness of each clip. CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Moment Retrieval Charades-STA CG-DETR R@1 IoU=0.5 58.44 # 8
R@1 IoU=0.7 36.34 # 8
Moment Retrieval QVHighlights CG-DETR mAP 42.86 # 10
R@1 IoU=0.5 65.43 # 5
R@1 IoU=0.7 48.38 # 7
mAP@0.5 64.51 # 7
mAP@0.75 42.77 # 9
Highlight Detection QVHighlights CG-DETR (w/ PT) mAP 40.71 # 3
Hit@1 66.60 # 2
Highlight Detection QVHighlights CG-DETR mAP 40.33 # 5
Hit@1 66.21 # 4
Moment Retrieval QVHighlights CG-DETR (w/ PT) mAP 47.97 # 1
R@1 IoU=0.5 68.48 # 1
R@1 IoU=0.7 53.11 # 1
mAP@0.5 69.40 # 1
mAP@0.75 49.12 # 1
Natural Language Moment Retrieval TACoS CG-DETR R@1,IoU=0.3 52.23 # 2
R@1,IoU=0.5 39.61 # 2
R@1,IoU=0.7 22.23 # 3
mIoU 36.48 # 2
Highlight Detection TvSum CG-DETR mAP 86.8 # 1
Highlight Detection YouTube Highlights CG-DETR mAP 75.9 # 2

Methods


No methods listed for this paper. Add relevant methods here