TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@1	66.8	# 5
Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@5	89.1	# 3
Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@10	94.9	# 3
Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	video-to-text R@1	64.4	# 2
Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	video-to-text R@5	89.1	# 1
Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	video-to-text R@10	94.8	# 1
Zero-Shot Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@1	42.8	# 3
Zero-Shot Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	video-to-text R@1	40.7	# 4
Zero-Shot Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@10	79.8	# 4
Zero-Shot Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@5	69.6	# 3
Zero-Shot Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	video-to-text R@5	67.6	# 5
Zero-Shot Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	video-to-text R@10	78.6	# 5
Video Question Answering	ActivityNet-QA	UMT-L (ViT-L/16)	Accuracy	47.9	# 10
Action Recognition	AVA v2.2	UMT-L (ViT-L/16)	mAP	39.8	# 8
Zero-Shot Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@1	48.6	# 5
Zero-Shot Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@5	72.9	# 5
Zero-Shot Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@10	79.0	# 6
Zero-Shot Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	video-to-text R@1	49.9	# 4
Zero-Shot Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	video-to-text R@5	74.8	# 4
Zero-Shot Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	video-to-text R@10	81.4	# 4
Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@1	70.4	# 5
Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@5	90.1	# 2
Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@10	93.5	# 2
Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	video-to-text R@1	65.7	# 3
Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	video-to-text R@10	93.3	# 2
Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	video-to-text R@5	89.6	# 2
Action Classification	Kinetics-400	UMT-L (ViT-L/16)	Acc@1	90.6	# 6
Action Classification	Kinetics-400	UMT-L (ViT-L/16)	Acc@5	98.7	# 2
Action Classification	Kinetics-400	Unmasked Teacher (ViT-L)	Acc@1	90.6	# 6
Action Classification	Kinetics-400	Unmasked Teacher (ViT-L)	Acc@5	98.7	# 2
Action Classification	Kinetics-400	Unmasked Teacher (ViT-L)	FLOPs (G) x views	1434×3×4	# 1
Action Classification	Kinetics-400	Unmasked Teacher (ViT-L)	Parameters (M)	304	# 26
Action Classification	Kinetics-600	UMT-L (ViT-L/16)	Top-1 Accuracy	90.5	# 8
Action Classification	Kinetics-600	UMT-L (ViT-L/16)	Top-5 Accuracy	98.8	# 2
Action Classification	Kinetics-700	UMT-L (ViT-L/16)	Top-1 Accuracy	83.6	# 5
Action Classification	Kinetics-700	UMT-L (ViT-L/16)	Top-5 Accuracy	96.7	# 1
Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@1	43.0	# 3
Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@5	65.5	# 2
Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@10	73.0	# 2
Video Retrieval	LSMDC	UMT-L (ViT-L/16)	video-to-text R@1	41.4	# 2
Video Retrieval	LSMDC	UMT-L (ViT-L/16)	video-to-text R@5	64.3	# 3
Video Retrieval	LSMDC	UMT-L (ViT-L/16)	video-to-text R@10	71.5	# 2
Zero-Shot Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@1	25.2	# 3
Zero-Shot Video Retrieval	LSMDC	UMT-L (ViT-L/16)	video-to-text R@1	23.2	# 3
Zero-Shot Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@5	43.0	# 4
Zero-Shot Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@10	50.5	# 4
Zero-Shot Video Retrieval	LSMDC	UMT-L (ViT-L/16)	video-to-text R@5	37.7	# 3
Zero-Shot Video Retrieval	LSMDC	UMT-L (ViT-L/16)	video-to-text R@10	44.2	# 3
Action Classification	MiT	UMT-L (ViT-L/16)	Top 1 Accuracy	48.7	# 4
Action Classification	MiT	UMT-L (ViT-L/16)	Top 5 Accuracy	78.2	# 1
Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@1	58.8	# 4
Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@5	81.0	# 3
Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@10	87.1	# 4
Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	video-to-text R@1	58.6	# 5
Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	video-to-text R@5	81.6	# 4
Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	video-to-text R@10	86.5	# 5
Zero-Shot Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@1	42.6	# 7
Zero-Shot Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@5	64.4	# 8
Zero-Shot Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@10	73.1	# 8
Zero-Shot Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	video-to-text R@1	38.6	# 5
Zero-Shot Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	video-to-text R@5	59.8	# 5
Zero-Shot Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	video-to-text R@10	69.6	# 5
Visual Question Answering (VQA)	MSRVTT-QA	UMT-L (ViT-L/16)	Accuracy	0.471	# 6
Zero-Shot Video Retrieval	MSVD	UMT-L (ViT-L/16)	text-to-video R@1	49.0	# 6
Zero-Shot Video Retrieval	MSVD	UMT-L (ViT-L/16)	video-to-text R@1	74.5	# 4
Zero-Shot Video Retrieval	MSVD	UMT-L (ViT-L/16)	text-to-video R@5	76.9	# 6
Zero-Shot Video Retrieval	MSVD	UMT-L (ViT-L/16)	text-to-video R@10	84.7	# 8
Zero-Shot Video Retrieval	MSVD	UMT-L (ViT-L/16)	video-to-text R@5	89.7	# 6
Zero-Shot Video Retrieval	MSVD	UMT-L (ViT-L/16)	video-to-text R@10	92.8	# 6
Visual Question Answering (VQA)	MSVD-QA	UMT-L (ViT-L/16)	Accuracy	0.552	# 13
Video Retrieval	SSv2-label retrieval	UMT-L (ViT-L/16)	text-to-video R@1	73.3	# 1
Video Retrieval	SSv2-label retrieval	UMT-L (ViT-L/16)	text-to-video R@5	92.7	# 2
Video Retrieval	SSv2-label retrieval	UMT-L (ViT-L/16)	text-to-video R@10	96.6	# 1
Video Retrieval	SSv2-template retrieval	UMT-L (ViT-L/16)	text-to-video R@1	90.8	# 1
Video Retrieval	SSv2-template retrieval	UMT-L (ViT-L/16)	text-to-video R@5	100.0	# 1
Video Retrieval	SSv2-template retrieval	UMT-L (ViT-L/16)	text-to-video R@10	100.0	# 1
Video Retrieval	VATEX	Unmasked Teacher	text-to-video R@1	72	# 4
Video Retrieval	VATEX	Unmasked Teacher	text-to-video R@10	97.8	# 3
Video Retrieval	VATEX	Unmasked Teacher	video-to-text R@1	86.0	# 3
Video Retrieval	VATEX	Unmasked Teacher	video-to-text R@10	99.6	# 1
Video Retrieval	VATEX	Unmasked Teacher	text-to-video R@5	95.1	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-retrieval-on-ssv2-label-retrieval)](https://paperswithcode.com/sota/video-retrieval-on-ssv2-label-retrieval?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-retrieval-on-ssv2-template-retrieval)](https://paperswithcode.com/sota/video-retrieval-on-ssv2-template-retrieval?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/action-classification-on-moments-in-time)](https://paperswithcode.com/sota/action-classification-on-moments-in-time?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/action-recognition-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=unmasked-teacher-towards-training-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unmasked-teacher-towards-training-efficient/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=unmasked-teacher-towards-training-efficient)`

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

ICCV 2023 · Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao ·

Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Code

Add Remove Mark official

opengvlab/unmasked_teacher official

243

Tasks

Add Remove

Action Classification

Action Recognition

Spatio-Temporal Action Localization

Video Question Answering

Video Retrieval

Visual Question Answering (VQA)

Zero-Shot Video Retrieval

Datasets

Kinetics

ActivityNet

Kinetics 400

MSR-VTT

MSVD

Something-Something V2

ActivityNet Captions

DiDeMo

WebVid

Kinetics-600

LSMDC

CC12M

VATEX

AVA

MiT

ActivityNet-QA

Kinetics-700 MSRVTT-QA MSVD-QA MSRVTT-MC

Results from the Paper

Add Remove

Ranked #1 on Video Retrieval on SSv2-template retrieval (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@1	66.8	# 5	Compare
			text-to-video R@5	89.1	# 3	Compare
			text-to-video R@10	94.9	# 3	Compare
			video-to-text R@1	64.4	# 2	Compare
			video-to-text R@5	89.1	# 1	Compare
			video-to-text R@10	94.8	# 1	Compare
Zero-Shot Video Retrieval	ActivityNet	UMT-L (ViT-L/16)	text-to-video R@1	42.8	# 3	Compare
			video-to-text R@1	40.7	# 4	Compare
			text-to-video R@10	79.8	# 4	Compare
			text-to-video R@5	69.6	# 3	Compare
			video-to-text R@5	67.6	# 5	Compare
			video-to-text R@10	78.6	# 5	Compare
Video Question Answering	ActivityNet-QA	UMT-L (ViT-L/16)	Accuracy	47.9	# 10	Compare
Action Recognition	AVA v2.2	UMT-L (ViT-L/16)	mAP	39.8	# 8	Compare
Zero-Shot Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@1	48.6	# 5	Compare
			text-to-video R@5	72.9	# 5	Compare
			text-to-video R@10	79.0	# 6	Compare
			video-to-text R@1	49.9	# 4	Compare
			video-to-text R@5	74.8	# 4	Compare
			video-to-text R@10	81.4	# 4	Compare
Video Retrieval	DiDeMo	UMT-L (ViT-L/16)	text-to-video R@1	70.4	# 5	Compare
			text-to-video R@5	90.1	# 2	Compare
			text-to-video R@10	93.5	# 2	Compare
			video-to-text R@1	65.7	# 3	Compare
			video-to-text R@10	93.3	# 2	Compare
			video-to-text R@5	89.6	# 2	Compare
Action Classification	Kinetics-400	UMT-L (ViT-L/16)	Acc@1	90.6	# 6	Compare
Action Classification	Kinetics-400	UMT-L (ViT-L/16)	Acc@5	98.7	# 2	Compare
Action Classification	Kinetics-400	Unmasked Teacher (ViT-L)	Acc@1	90.6	# 6	Compare
			Acc@5	98.7	# 2	Compare
			FLOPs (G) x views	1434×3×4	# 1	Compare
			Parameters (M)	304	# 26	Compare
Action Classification	Kinetics-600	UMT-L (ViT-L/16)	Top-1 Accuracy	90.5	# 8	Compare
Action Classification	Kinetics-600	UMT-L (ViT-L/16)	Top-5 Accuracy	98.8	# 2	Compare
Action Classification	Kinetics-700	UMT-L (ViT-L/16)	Top-1 Accuracy	83.6	# 5	Compare
Action Classification	Kinetics-700	UMT-L (ViT-L/16)	Top-5 Accuracy	96.7	# 1	Compare
Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@1	43.0	# 3	Compare
			text-to-video R@5	65.5	# 2	Compare
			text-to-video R@10	73.0	# 2	Compare
			video-to-text R@1	41.4	# 2	Compare
			video-to-text R@5	64.3	# 3	Compare
			video-to-text R@10	71.5	# 2	Compare
Zero-Shot Video Retrieval	LSMDC	UMT-L (ViT-L/16)	text-to-video R@1	25.2	# 3	Compare
			video-to-text R@1	23.2	# 3	Compare
			text-to-video R@5	43.0	# 4	Compare
			text-to-video R@10	50.5	# 4	Compare
			video-to-text R@5	37.7	# 3	Compare
			video-to-text R@10	44.2	# 3	Compare
Action Classification	MiT	UMT-L (ViT-L/16)	Top 1 Accuracy	48.7	# 4	Compare
Action Classification	MiT	UMT-L (ViT-L/16)	Top 5 Accuracy	78.2	# 1	Compare
Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@1	58.8	# 4	Compare
			text-to-video R@5	81.0	# 3	Compare
			text-to-video R@10	87.1	# 4	Compare
			video-to-text R@1	58.6	# 5	Compare
			video-to-text R@5	81.6	# 4	Compare
			video-to-text R@10	86.5	# 5	Compare
Zero-Shot Video Retrieval	MSR-VTT	UMT-L (ViT-L/16)	text-to-video R@1	42.6	# 7	Compare
			text-to-video R@5	64.4	# 8	Compare
			text-to-video R@10	73.1	# 8	Compare
			video-to-text R@1	38.6	# 5	Compare
			video-to-text R@5	59.8	# 5	Compare
			video-to-text R@10	69.6	# 5	Compare
Visual Question Answering (VQA)	MSRVTT-QA	UMT-L (ViT-L/16)	Accuracy	0.471	# 6	Compare
Zero-Shot Video Retrieval	MSVD	UMT-L (ViT-L/16)	text-to-video R@1	49.0	# 6	Compare
			video-to-text R@1	74.5	# 4	Compare
			text-to-video R@5	76.9	# 6	Compare
			text-to-video R@10	84.7	# 8	Compare
			video-to-text R@5	89.7	# 6	Compare
			video-to-text R@10	92.8	# 6	Compare
Visual Question Answering (VQA)	MSVD-QA	UMT-L (ViT-L/16)	Accuracy	0.552	# 13	Compare
Video Retrieval	SSv2-label retrieval	UMT-L (ViT-L/16)	text-to-video R@1	73.3	# 1	Compare
			text-to-video R@5	92.7	# 2	Compare
			text-to-video R@10	96.6	# 1	Compare
Video Retrieval	SSv2-template retrieval	UMT-L (ViT-L/16)	text-to-video R@1	90.8	# 1	Compare
			text-to-video R@5	100.0	# 1	Compare
			text-to-video R@10	100.0	# 1	Compare
Video Retrieval	VATEX	Unmasked Teacher	text-to-video R@1	72	# 4	Compare
			text-to-video R@10	97.8	# 3	Compare
			video-to-text R@1	86.0	# 3	Compare
			video-to-text R@10	99.6	# 1	Compare
			text-to-video R@5	95.1	# 3	Compare

Methods

Add Remove

ALIGN

Edit Social Preview

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove