TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Segmentation	COIN	Norton	Frame accuracy	69.8	# 3
Zero-Shot Video Retrieval	MSR-VTT	Norton	text-to-video R@1	10.7	# 30
Zero-Shot Video Retrieval	MSR-VTT	Norton	text-to-video R@5	24.1	# 30
Video Question Answering	MSRVTT-MC	Norton	Accuracy	92.7	# 6
Zero-Shot Video Retrieval	YouCook2	Norton	text-to-video R@1	24.2	# 1
Zero-Shot Video Retrieval	YouCook2	Norton	text-to-video R@5	51.9	# 1
Zero-Shot Video Retrieval	YouCook2	Norton	text-to-video R@10	64.1	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	Cap. Avg. R@1	75.5	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	Cap. Avg. R@5	95.0	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	Cap. Avg. R@10	97.7	# 2
Long Video Retrieval (Background Removed)	YouCook2	Norton	DTW R@1	88.7	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	DTW R@5	98.8	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	DTW R@10	99.5	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	OTAM R@1	88.9	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	OTAM R@5	98.4	# 1
Long Video Retrieval (Background Removed)	YouCook2	Norton	OTAM R@10	99.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-granularity-correspondence-learning-1/zero-shot-video-retrieval-on-youcook2)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-youcook2?p=multi-granularity-correspondence-learning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-granularity-correspondence-learning-1/long-video-retrieval-background-removed-on)](https://paperswithcode.com/sota/long-video-retrieval-background-removed-on?p=multi-granularity-correspondence-learning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-granularity-correspondence-learning-1/action-segmentation-on-coin)](https://paperswithcode.com/sota/action-segmentation-on-coin?p=multi-granularity-correspondence-learning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-granularity-correspondence-learning-1/video-question-answering-on-msrvtt-mc)](https://paperswithcode.com/sota/video-question-answering-on-msrvtt-mc?p=multi-granularity-correspondence-learning-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-granularity-correspondence-learning-1/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=multi-granularity-correspondence-learning-1)`

Multi-granularity Correspondence Learning from Long-term Noisy Videos

30 Jan 2024 · Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng ·

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at https://lin-yijie.github.io/projects/Norton.

PDF Abstract

Code

Add Remove Mark official

XLearning-SCU/2024-ICLR-Norton

Tasks

Add Remove

Action Segmentation

Long Video Retrieval (Background Removed)

Video Retrieval

Video Understanding

Datasets

MSR-VTT

HowTo100M

YouCook2 COIN MSRVTT-MC

Results from the Paper

Add Remove

Ranked #1 on Zero-Shot Video Retrieval on YouCook2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Segmentation	COIN	Norton	Frame accuracy	69.8	# 3	Compare
Zero-Shot Video Retrieval	MSR-VTT	Norton	text-to-video R@1	10.7	# 30	Compare
Zero-Shot Video Retrieval	MSR-VTT	Norton	text-to-video R@5	24.1	# 30	Compare
Video Question Answering	MSRVTT-MC	Norton	Accuracy	92.7	# 6	Compare
Zero-Shot Video Retrieval	YouCook2	Norton	text-to-video R@1	24.2	# 1	Compare
			text-to-video R@5	51.9	# 1	Compare
			text-to-video R@10	64.1	# 1	Compare
Long Video Retrieval (Background Removed)	YouCook2	Norton	Cap. Avg. R@1	75.5	# 1	Compare
			Cap. Avg. R@5	95.0	# 1	Compare
			Cap. Avg. R@10	97.7	# 2	Compare
			DTW R@1	88.7	# 1	Compare
			DTW R@5	98.8	# 1	Compare
			DTW R@10	99.5	# 1	Compare
			OTAM R@1	88.9	# 1	Compare
			OTAM R@5	98.4	# 1	Compare
			OTAM R@10	99.5	# 1	Compare

Methods

Add Remove

Focus

Edit Social Preview

Multi-granularity Correspondence Learning from Long-term Noisy Videos

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove