TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	Wave-ViT-L	Top 1 Accuracy	85.5%	# 212
Image Classification	ImageNet	Wave-ViT-L	Number of params	57.5M	# 761
Image Classification	ImageNet	Wave-ViT-L	GFLOPs	14.8	# 336
Image Classification	ImageNet	Wave-ViT-S	Top 1 Accuracy	83.9%	# 347
Image Classification	ImageNet	Wave-ViT-S	Number of params	22.7M	# 571
Image Classification	ImageNet	Wave-ViT-S	GFLOPs	4.7	# 220
Image Classification	ImageNet	Wave-ViT-B	Top 1 Accuracy	84.8%	# 270
Image Classification	ImageNet	Wave-ViT-B	Number of params	33.5M	# 655
Image Classification	ImageNet	Wave-ViT-B	GFLOPs	7.2	# 252

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wave-vit-unifying-wavelet-and-transformers/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=wave-vit-unifying-wavelet-and-transformers)`

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

11 Jul 2022 · Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, Tao Mei ·

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.

PDF Abstract

Code

Add Remove Mark official

yehli/imagenetmodel official

180

towhee-io/towhee

2,997

Tasks

Add Remove

Image Classification

Instance Segmentation

object-detection

Object Detection

Representation Learning

Semantic Segmentation

Datasets

ImageNet

MS COCO

Results from the Paper

Edit

Ranked #212 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	Wave-ViT-L	Top 1 Accuracy	85.5%	# 212	Compare
			Number of params	57.5M	# 761	Compare
			GFLOPs	14.8	# 336	Compare
Image Classification	ImageNet	Wave-ViT-S	Top 1 Accuracy	83.9%	# 347	Compare
			Number of params	22.7M	# 571	Compare
			GFLOPs	4.7	# 220	Compare
Image Classification	ImageNet	Wave-ViT-B	Top 1 Accuracy	84.8%	# 270	Compare
			Number of params	33.5M	# 655	Compare
			GFLOPs	7.2	# 252	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove