TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (70%)	Top 1 Accuracy	79.6	# 16
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (70%)	GFLOPs	3.0	# 23
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (50%)	Top 1 Accuracy	79.0	# 31
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (50%)	GFLOPs	2.3	# 5
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (60%)	Top 1 Accuracy	79.3	# 25
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (60%)	GFLOPs	2.6	# 12
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (20%)	Top 1 Accuracy	76.4	# 41
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (20%)	GFLOPs	1.6	# 1
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (30%)	Top 1 Accuracy	77.8	# 40
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (30%)	GFLOPs	1.8	# 2
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (40%)	Top 1 Accuracy	78.6	# 34
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (40%)	GFLOPs	2.0	# 3
Efficient ViTs	ImageNet-1K (with DeiT-T)	BAT	Top 1 Accuracy	72.3	# 4
Efficient ViTs	ImageNet-1K (with DeiT-T)	BAT	GFLOPs	0.8	# 8
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	BAT	Top 1 Accuracy	83.1	# 6
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	BAT	GFLOPs	4.7	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beyond-attentive-tokens-incorporating-token/efficient-vits-on-imagenet-1k-with-deit-t)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-deit-t?p=beyond-attentive-tokens-incorporating-token)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beyond-attentive-tokens-incorporating-token/efficient-vits-on-imagenet-1k-with-lv-vit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-lv-vit-s?p=beyond-attentive-tokens-incorporating-token)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/beyond-attentive-tokens-incorporating-token/efficient-vits-on-imagenet-1k-with-deit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-deit-s?p=beyond-attentive-tokens-incorporating-token)`

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

CVPR 2023 · Sifan Long, Zhen Zhao, Jimin Pi, Shengsheng Wang, Jingdong Wang ·

Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserving the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

BWLONG/BeyondAttentiveTokens official

Tasks

Add Remove

Computational Efficiency

Efficient ViTs

Datasets

ImageNet

Results from the Paper

Edit

Ranked #4 on Efficient ViTs on ImageNet-1K (with DeiT-T)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (70%)	Top 1 Accuracy	79.6	# 16	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (70%)	GFLOPs	3.0	# 23	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (50%)	Top 1 Accuracy	79.0	# 31	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (50%)	GFLOPs	2.3	# 5	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (60%)	Top 1 Accuracy	79.3	# 25	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (60%)	GFLOPs	2.6	# 12	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (20%)	Top 1 Accuracy	76.4	# 41	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (20%)	GFLOPs	1.6	# 1	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (30%)	Top 1 Accuracy	77.8	# 40	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (30%)	GFLOPs	1.8	# 2	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (40%)	Top 1 Accuracy	78.6	# 34	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	BAT (40%)	GFLOPs	2.0	# 3	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	BAT	Top 1 Accuracy	72.3	# 4	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	BAT	GFLOPs	0.8	# 8	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	BAT	Top 1 Accuracy	83.1	# 6	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	BAT	GFLOPs	4.7	# 5	Compare

Methods

Add Remove

Pruning

Edit Social Preview

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove