Efficient ViTs

26 papers with code • 3 benchmarks • 0 datasets

Increasing the efficiency of ViTs without the modification of the architecture. (i.e., Key & Query Sparsification, Token pruning & merging)

Most implemented papers

Global Vision Transformer Pruning with Hessian-Aware Saliency

NVlabs/NViT CVPR 2023

This work aims on challenging the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage, where we redistribute the parameters both across transformer blocks and between different structures within the block via the first systematic attempt on global structural pruning.

Adaptive Token Sampling For Efficient Vision Transformers

adaptivetokensampling/ATS 30 Nov 2021

Since ATS is a parameter-free module, it can be added to the off-the-shelf pre-trained vision transformers as a plug and play module, thus reducing their GFLOPs without any additional training.

AdaViT: Adaptive Tokens for Efficient Vision Transformer

NVlabs/A-ViT CVPR 2022

A-ViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

peiyanflying/spvit 27 Dec 2021

Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms.

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

youweiliang/evit 16 Feb 2022

Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

cydia2018/as-vit 28 Sep 2022

The learnable thresholds are optimized in budget-aware training to balance accuracy and complexity, performing the corresponding pruning configurations for different input instances.

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

gatech-eic/castling-vit CVPR 2023

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens.

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

BWLONG/BeyondAttentiveTokens CVPR 2023

In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning.

Making Vision Transformers Efficient from A Token Sparsification View

changsn/STViT-R CVPR 2023

In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks.

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

megvii-research/tps-cvpr2023 CVPR 2023

Experiments on various transformers demonstrate the effectiveness of our method, while analysis experiments prove our higher robustness to the errors of the token pruning policy.