Going deeper with Image Transformers

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Results from the Paper


Ranked #5 on Image Classification on CIFAR-10 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CaiT-M-36 U 224 Percentage correct 99.4 # 5
Image Classification CIFAR-100 CaiT-M-36 U 224 Percentage correct 93.1 # 11
Image Classification Flowers-102 CaiT-M-36 U 224 Accuracy 99.1 # 13
Image Classification ImageNet CAIT-XXS-36 Top 1 Accuracy 82.2% # 511
Number of params 17.3M # 523
GFLOPs 14.3 # 335
Image Classification ImageNet CAIT-XS-24 Top 1 Accuracy 84.1% # 326
Number of params 26.6M # 614
GFLOPs 19.3 # 364
Image Classification ImageNet CAIT-XS-36 Top 1 Accuracy 84.8% # 271
Number of params 38.6M # 666
GFLOPs 28.8 # 389
Image Classification ImageNet CAIT-S-24 Top 1 Accuracy 85.1% # 246
Number of params 46.9M # 711
GFLOPs 32.2 # 398
Image Classification ImageNet CAIT-S-36 Top 1 Accuracy 85.4% # 222
Number of params 68.2M # 787
GFLOPs 48 # 421
Image Classification ImageNet CAIT-M-24 Top 1 Accuracy 85.8% # 188
Number of params 185.9M # 888
GFLOPs 116.1 # 458
Image Classification ImageNet CAIT-M-36 Top 1 Accuracy 86.1% # 171
Number of params 270.9M # 909
GFLOPs 173.3 # 464
Image Classification ImageNet CAIT-M36-448 Top 1 Accuracy 86.3% # 154
Number of params 271M # 910
GFLOPs 247.8 # 472
Image Classification ImageNet CAIT-S-48 Top 1 Accuracy 85.3% # 232
Number of params 89.5M # 846
GFLOPs 63.8 # 435
Image Classification ImageNet CAIT-XXS-24 Top 1 Accuracy 80.9% # 619
Number of params 12M # 497
GFLOPs 9.6 # 292
Image Classification ImageNet CaiT-M-48-448 Top 1 Accuracy 86.5% # 136
Number of params 438M # 931
GFLOPs 377.3 # 480
Image Classification ImageNet ReaL CAIT-M36-448 Accuracy 90.2% # 19
Image Classification ImageNet V2 CAIT-M36-448 Top 1 Accuracy 76.7 # 16
Image Classification iNaturalist 2018 CaiT-M-36 U 224 Top-1 Accuracy 78% # 18
Image Classification iNaturalist 2019 CaiT-M-36 U 224 Top-1 Accuracy 81.8 # 7
Image Classification Stanford Cars CaiT-M-36 U 224 Accuracy 94.2 # 5

Methods