Asymmetric Masked Distillation for Pre-Training Small Foundation Models

6 Nov 2023  ·  Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang ·

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition AVA v2.2 AMD(ViT-B/16) mAP 33.5 # 21
Action Recognition HMDB-51 AMD(ViT-B/16) Average accuracy of 3 splits 79.6 # 23
Image Classification ImageNet AMD(ViT-S/16) Top 1 Accuracy 82.1% # 525
Number of params 22M # 557
Image Classification ImageNet AMD(ViT-B/16) Top 1 Accuracy 84.6% # 288
Number of params 87M # 822
Action Classification Kinetics-400 AMD(ViT-S/16) Acc@1 80.1 # 97
Acc@5 94.5 # 65
FLOPs (G) x views 57X15 # 1
Parameters (M) 22 # 12
Action Classification Kinetics-400 AMD(ViT-B/16) Acc@1 82.2 # 72
Acc@5 95.3 # 48
FLOPs (G) x views 180x15 # 1
Parameters (M) 87 # 24
Action Recognition Something-Something V2 AMD(ViT-S/16) Top-1 Accuracy 70.2 # 36
Top-5 Accuracy 92.5 # 26
Parameters 22 # 34
GFLOPs 57x6 # 6
Action Recognition Something-Something V2 AMD(ViT-B/16) Top-1 Accuracy 73.3 # 20
Top-5 Accuracy 94.0 # 13
Parameters 87 # 25
GFLOPs 180x6 # 6
Action Recognition UCF101 AMD(ViT-B/16) 3-fold Accuracy 97.1 # 20

Methods