The effectiveness of MAE pre-pretraining for billion-scale pretraining

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract

Results from the Paper


 Ranked #1 on Few-Shot Image Classification on ImageNet - 10-shot (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Zero-Shot Transfer Image Classification Food-101 MAWS (ViT-2B) Top 1 Accuracy 96.2 # 1
Image Classification ImageNet MAWS (ViT-6.5B) Top 1 Accuracy 90.1% # 16
Number of params 6500M # 977
Image Classification ImageNet MAWS (ViT-B) Top 1 Accuracy 86.8% # 121
Image Classification ImageNet MAWS (ViT-L) Top 1 Accuracy 88.8% # 38
Zero-Shot Transfer Image Classification ImageNet MAWS (ViT-2B) Accuracy (Private) 82.1 # 12
Image Classification ImageNet MAWS (ViT-2B) Top 1 Accuracy 89.8% # 21
Number of params 2000M # 965
Image Classification ImageNet MAWS (ViT-H) Top 1 Accuracy 89.5% # 28
Number of params 650M # 945
Zero-Shot Transfer Image Classification ImageNet MAWS (ViT-H) Accuracy (Private) 81.1 # 15
Few-Shot Image Classification ImageNet - 10-shot MAWS (ViT-H) Top 1 Accuracy 82.5 # 4
Few-Shot Image Classification ImageNet - 10-shot MAWS (ViT-2B) Top 1 Accuracy 83.7 # 3
Few-Shot Image Classification ImageNet - 10-shot MAWS (ViT-6.5B) Top 1 Accuracy 84.6 # 1
Few-Shot Image Classification ImageNet - 1-shot MAWS (ViT-H) Top 1 Accuracy 57.1 # 8
Few-Shot Image Classification ImageNet - 1-shot MAWS (ViT-6.5B) Top 1 Accuracy 63.6 # 2
Few-Shot Image Classification ImageNet - 1-shot MAWS (ViT-2B) Top 1 Accuracy 62.1 # 7
Few-Shot Image Classification ImageNet - 5-shot MAWS (ViT-6.5B) Top 1 Accuracy 82.6 # 2
Few-Shot Image Classification ImageNet - 5-shot MAWS (ViT-H) Top 1 Accuracy 79.8 # 4
Few-Shot Image Classification ImageNet - 5-shot MAWS (ViT-2B) Top 1 Accuracy 81.5 # 3
Image Classification ImageNet ReaL MAWS (ViT-H) Accuracy 90.8% # 12
Image Classification ImageNet ReaL MAWS (ViT-6.5B) Accuracy 91.1% # 5
Image Classification ImageNet ReaL MAWS (ViT-2B) Accuracy 90.9% # 9
Image Classification ImageNet V2 MAWS (ViT-6.5B) Top 1 Accuracy 84.0 # 4
Image Classification ImageNet V2 MAWS (ViT-2B) Top 1 Accuracy 83.0 # 7
Image Classification iNaturalist 2018 MAWS (ViT-2B) Top-1 Accuracy 91.3% # 3
Few-Shot Image Classification iNaturalist 2018 - 10-shot MAWS (ViT-2B) Top 1 Accuracy 80.3 # 1
Few-Shot Image Classification iNaturalist 2018 - 1-shot MAWS (ViT-2B) Top 1 Accuracy 35.5 # 1
Few-Shot Image Classification iNaturalist 2018 - 5-shot MAWS (ViT-2B) Top 1 Accuracy 72.8 # 1
Image Classification ObjectNet MAWS (ViT-H) Top-1 Accuracy 72.6 # 9
Image Classification ObjectNet MAWS (ViT-2B) Top-1 Accuracy 75.8 # 8
Image Classification ObjectNet MAWS (ViT-6.5B) Top-1 Accuracy 77.9 # 7
Action Recognition Something-Something V2 MAWS (ViT-L) Top-1 Accuracy 74.4 # 14

Methods