TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Aerial Scene Classification	AID (20% as trainset)	ViTAE-B + RVSA	Accuracy	97.03	# 1
Aerial Scene Classification	AID (20% as trainset)	ViT-B + RVSA	Accuracy	96.92	# 2
Aerial Scene Classification	AID (50% as trainset)	ViT-B + RVSA	Accuracy	98.44	# 2
Aerial Scene Classification	AID (50% as trainset)	ViTAE-B + RVSA	Accuracy	98.50	# 1
Object Detection In Aerial Images	DIOR-R	ViTAE-B + RVSA-ORCN	mAP	71.05	# 5
Object Detection In Aerial Images	DIOR-R	ViT-B + RVSA-ORCN	mAP	70.85	# 6
Object Detection In Aerial Images	DOTA	ViTAE-B + RVSA-ORCN	mAP	81.24%	# 8
Object Detection In Aerial Images	DOTA	ViT-B + RVSA-ORCN	mAP	81.01%	# 9
Semantic Segmentation	iSAID	ViTAE-B + RVSA-UperNet	mIoU	64.49	# 14
Semantic Segmentation	iSAID	ViT-B + RVSA-UperNet	mIoU	63.85	# 17
Semantic Segmentation	ISPRS Potsdam	ViTAE-B + RVSA -UperNet	Overall Accuracy	91.22	# 11
Semantic Segmentation	ISPRS Potsdam	ViT-B + RVSA-UperNet	Overall Accuracy	90.77	# 15
Semantic Segmentation	LoveDA	ViTAE-B + RVSA-UperNet	Category mIoU	52.44	# 8
Semantic Segmentation	LoveDA	ViT-B + RVSA-UperNet	Category mIoU	51.95	# 11
Aerial Scene Classification	NWPU (10% as trainset)	ViT-B + RVSA	Accuracy	93.79	# 5
Aerial Scene Classification	NWPU (10% as trainset)	ViTAE-B + RVSA	Accuracy	93.93	# 2
Aerial Scene Classification	NWPU (20% as trainset)	ViTAE-B + RVSA	Accuracy	95.69	# 3
Aerial Scene Classification	NWPU (20% as trainset)	ViT-B + RVSA	Accuracy	95.49	# 6
Aerial Scene Classification	UCM (50% as trainset)	ViT-B + RVSA	Accuracy	99.70	# 1
Aerial Scene Classification	UCM (50% as trainset)	ViTAE-B + RVSA	Accuracy	99.56	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/aerial-scene-classification-on-aid-20-as)](https://paperswithcode.com/sota/aerial-scene-classification-on-aid-20-as?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/aerial-scene-classification-on-aid-50-as)](https://paperswithcode.com/sota/aerial-scene-classification-on-aid-50-as?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/aerial-scene-classification-on-ucm-50-as)](https://paperswithcode.com/sota/aerial-scene-classification-on-ucm-50-as?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/aerial-scene-classification-on-nwpu-10-as)](https://paperswithcode.com/sota/aerial-scene-classification-on-nwpu-10-as?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/aerial-scene-classification-on-nwpu-20-as)](https://paperswithcode.com/sota/aerial-scene-classification-on-nwpu-20-as?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/object-detection-in-aerial-images-on-dior-r)](https://paperswithcode.com/sota/object-detection-in-aerial-images-on-dior-r?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/object-detection-in-aerial-images-on-dota-1)](https://paperswithcode.com/sota/object-detection-in-aerial-images-on-dota-1?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/semantic-segmentation-on-loveda)](https://paperswithcode.com/sota/semantic-segmentation-on-loveda?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/semantic-segmentation-on-isprs-potsdam)](https://paperswithcode.com/sota/semantic-segmentation-on-isprs-potsdam?p=advancing-plain-vision-transformer-towards)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/advancing-plain-vision-transformer-towards/semantic-segmentation-on-isaid)](https://paperswithcode.com/sota/semantic-segmentation-on-isaid?p=advancing-plain-vision-transformer-towards)`

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

8 Aug 2022 · Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, DaCheng Tao, Liangpei Zhang ·

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring.

PDF Abstract

Code

Add Remove Mark official

vitae-transformer/vitae-transformer… official

414

vitae-transformer/remote-sensing-rv… official

382

Tasks

Add Remove

Aerial Scene Classification

Few-Shot Learning

Object Detection In Aerial Images

Semantic Segmentation

Datasets

DOTA

iSAID

LoveDA

AID

Million-AID

ISPRS Potsdam UC Merced Land Use Dataset

Results from the Paper

Edit

Ranked #1 on Aerial Scene Classification on AID (50% as trainset)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Aerial Scene Classification	AID (20% as trainset)	ViTAE-B + RVSA	Accuracy	97.03	# 1	Compare
Aerial Scene Classification	AID (20% as trainset)	ViT-B + RVSA	Accuracy	96.92	# 2	Compare
Aerial Scene Classification	AID (50% as trainset)	ViT-B + RVSA	Accuracy	98.44	# 2	Compare
Aerial Scene Classification	AID (50% as trainset)	ViTAE-B + RVSA	Accuracy	98.50	# 1	Compare
Object Detection In Aerial Images	DIOR-R	ViTAE-B + RVSA-ORCN	mAP	71.05	# 5	Compare
Object Detection In Aerial Images	DIOR-R	ViT-B + RVSA-ORCN	mAP	70.85	# 6	Compare
Object Detection In Aerial Images	DOTA	ViTAE-B + RVSA-ORCN	mAP	81.24%	# 8	Compare
Object Detection In Aerial Images	DOTA	ViT-B + RVSA-ORCN	mAP	81.01%	# 9	Compare
Semantic Segmentation	iSAID	ViTAE-B + RVSA-UperNet	mIoU	64.49	# 14	Compare
Semantic Segmentation	iSAID	ViT-B + RVSA-UperNet	mIoU	63.85	# 17	Compare
Semantic Segmentation	ISPRS Potsdam	ViTAE-B + RVSA -UperNet	Overall Accuracy	91.22	# 11	Compare
Semantic Segmentation	ISPRS Potsdam	ViT-B + RVSA-UperNet	Overall Accuracy	90.77	# 15	Compare
Semantic Segmentation	LoveDA	ViTAE-B + RVSA-UperNet	Category mIoU	52.44	# 8	Compare
Semantic Segmentation	LoveDA	ViT-B + RVSA-UperNet	Category mIoU	51.95	# 11	Compare
Aerial Scene Classification	NWPU (10% as trainset)	ViT-B + RVSA	Accuracy	93.79	# 5	Compare
Aerial Scene Classification	NWPU (10% as trainset)	ViTAE-B + RVSA	Accuracy	93.93	# 2	Compare
Aerial Scene Classification	NWPU (20% as trainset)	ViTAE-B + RVSA	Accuracy	95.69	# 3	Compare
Aerial Scene Classification	NWPU (20% as trainset)	ViT-B + RVSA	Accuracy	95.49	# 6	Compare
Aerial Scene Classification	UCM (50% as trainset)	ViT-B + RVSA	Accuracy	99.70	# 1	Compare
Aerial Scene Classification	UCM (50% as trainset)	ViTAE-B + RVSA	Accuracy	99.56	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove