TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	AVA v2.2	AMD(ViT-B/16)	mAP	33.5	# 21
Action Recognition	HMDB-51	AMD(ViT-B/16)	Average accuracy of 3 splits	79.6	# 23
Image Classification	ImageNet	AMD(ViT-S/16)	Top 1 Accuracy	82.1%	# 525
Image Classification	ImageNet	AMD(ViT-S/16)	Number of params	22M	# 557
Image Classification	ImageNet	AMD(ViT-B/16)	Top 1 Accuracy	84.6%	# 288
Image Classification	ImageNet	AMD(ViT-B/16)	Number of params	87M	# 822
Action Classification	Kinetics-400	AMD(ViT-S/16)	Acc@1	80.1	# 97
Action Classification	Kinetics-400	AMD(ViT-S/16)	Acc@5	94.5	# 65
Action Classification	Kinetics-400	AMD(ViT-S/16)	FLOPs (G) x views	57X15	# 1
Action Classification	Kinetics-400	AMD(ViT-S/16)	Parameters (M)	22	# 12
Action Classification	Kinetics-400	AMD(ViT-B/16)	Acc@1	82.2	# 72
Action Classification	Kinetics-400	AMD(ViT-B/16)	Acc@5	95.3	# 48
Action Classification	Kinetics-400	AMD(ViT-B/16)	FLOPs (G) x views	180x15	# 1
Action Classification	Kinetics-400	AMD(ViT-B/16)	Parameters (M)	87	# 24
Action Recognition	Something-Something V2	AMD(ViT-S/16)	Top-1 Accuracy	70.2	# 36
Action Recognition	Something-Something V2	AMD(ViT-S/16)	Top-5 Accuracy	92.5	# 26
Action Recognition	Something-Something V2	AMD(ViT-S/16)	Parameters	22	# 34
Action Recognition	Something-Something V2	AMD(ViT-S/16)	GFLOPs	57x6	# 6
Action Recognition	Something-Something V2	AMD(ViT-B/16)	Top-1 Accuracy	73.3	# 20
Action Recognition	Something-Something V2	AMD(ViT-B/16)	Top-5 Accuracy	94.0	# 13
Action Recognition	Something-Something V2	AMD(ViT-B/16)	Parameters	87	# 25
Action Recognition	Something-Something V2	AMD(ViT-B/16)	GFLOPs	180x6	# 6
Action Recognition	UCF101	AMD(ViT-B/16)	3-fold Accuracy	97.1	# 20

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/asymmetric-masked-distillation-for-pre/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=asymmetric-masked-distillation-for-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/asymmetric-masked-distillation-for-pre/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=asymmetric-masked-distillation-for-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/asymmetric-masked-distillation-for-pre/action-recognition-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-on-ava-v2-2?p=asymmetric-masked-distillation-for-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/asymmetric-masked-distillation-for-pre/action-recognition-in-videos-on-hmdb-51)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=asymmetric-masked-distillation-for-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/asymmetric-masked-distillation-for-pre/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=asymmetric-masked-distillation-for-pre)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/asymmetric-masked-distillation-for-pre/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=asymmetric-masked-distillation-for-pre)`

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

6 Nov 2023 · Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang ·

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Classification

Action Recognition

Image Classification

Knowledge Distillation

Model Compression

Datasets

ImageNet

UCF101

Kinetics

HMDB51

Kinetics 400

Something-Something V2

AVA

Results from the Paper

Edit

Ranked #20 on Action Recognition on Something-Something V2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	AVA v2.2	AMD(ViT-B/16)	mAP	33.5	# 21	Compare
Action Recognition	HMDB-51	AMD(ViT-B/16)	Average accuracy of 3 splits	79.6	# 23	Compare
Image Classification	ImageNet	AMD(ViT-S/16)	Top 1 Accuracy	82.1%	# 525	Compare
Image Classification	ImageNet	AMD(ViT-S/16)	Number of params	22M	# 557	Compare
Image Classification	ImageNet	AMD(ViT-B/16)	Top 1 Accuracy	84.6%	# 288	Compare
Image Classification	ImageNet	AMD(ViT-B/16)	Number of params	87M	# 822	Compare
Action Classification	Kinetics-400	AMD(ViT-S/16)	Acc@1	80.1	# 97	Compare
			Acc@5	94.5	# 65	Compare
			FLOPs (G) x views	57X15	# 1	Compare
			Parameters (M)	22	# 12	Compare
Action Classification	Kinetics-400	AMD(ViT-B/16)	Acc@1	82.2	# 72	Compare
			Acc@5	95.3	# 48	Compare
			FLOPs (G) x views	180x15	# 1	Compare
			Parameters (M)	87	# 24	Compare
Action Recognition	Something-Something V2	AMD(ViT-S/16)	Top-1 Accuracy	70.2	# 36	Compare
			Top-5 Accuracy	92.5	# 26	Compare
			Parameters	22	# 34	Compare
			GFLOPs	57x6	# 6	Compare
Action Recognition	Something-Something V2	AMD(ViT-B/16)	Top-1 Accuracy	73.3	# 20	Compare
			Top-5 Accuracy	94.0	# 13	Compare
			Parameters	87	# 25	Compare
			GFLOPs	180x6	# 6	Compare
Action Recognition	UCF101	AMD(ViT-B/16)	3-fold Accuracy	97.1	# 20	Compare

Methods

Add Remove

Dense Connections • Knowledge Distillation • Layer Normalization • Linear Layer • MAE • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove