MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations
We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. The motivation behind MIM-Refiner is rooted in the insight that optimal representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to diverse intermediate layers. In each head, a modified nearest neighbor objective helps to construct respective semantic clusters. The refinement process is short but effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, achieves new state-of-the-art results in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. In ImageNet-1K 1-shot classification, MIM-Refiner sets a new state-of-the-art of 64.2%, outperforming larger models that were trained on up to 2000x more data such as DINOv2-g, OpenCLIP-G and MAWS-6.5B. Project page: https://ml-jku.github.io/MIM-Refiner
PDF AbstractCode
Datasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Image Clustering | ImageNet | MIM-Refiner (D2V2-ViT-H/14) | NMI | 87.2 | # 1 | |
Accuracy | 67.3 | # 1 | ||||
ARI | 42.2 | # 4 | ||||
Image Clustering | ImageNet | MIM-Refiner (MAE-ViT-H/14) | NMI | 85.3 | # 2 | |
Accuracy | 64.6 | # 2 | ||||
ARI | 45.5 | # 3 | ||||
Self-Supervised Image Classification | ImageNet | MIM-Refiner (MAE-ViT-L/16) | Top 1 Accuracy | 82.8% | # 9 | |
Number of Params | 307M | # 16 | ||||
Self-Supervised Image Classification | ImageNet | MIM-Refiner (D2V2-ViT-L/16) | Top 1 Accuracy | 83.5% | # 8 | |
Number of Params | 307M | # 16 | ||||
Self-Supervised Image Classification | ImageNet | MIM-Refiner (MAE-ViT-H/14 | Top 1 Accuracy | 83.7% | # 7 | |
Number of Params | 632M | # 6 | ||||
Self-Supervised Image Classification | ImageNet | MIM-Refiner (MAE-ViT-2B/14) | Top 1 Accuracy | 84.5% | # 5 | |
Number of Params | 1890M | # 2 | ||||
Self-Supervised Image Classification | ImageNet | MIM-Refiner (D2V2-ViT-H/14) | Top 1 Accuracy | 84.7% | # 4 | |
Number of Params | 632M | # 6 |