TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K	Focal-L (UperNet, ImageNet-22k pretrain)	Validation mIoU	55.40	# 41
Semantic Segmentation	ADE20K val	Focal-L (UperNet, ImageNet-22k pretrain)	mIoU	55.4	# 28
Instance Segmentation	COCO minival	Focal-L (HTC++, multi-scale)	mask AP	50.9	# 19
Object Detection	COCO minival	Focal-L (DyHead, multi-scale)	box AP	58.7	# 31
Object Detection	COCO minival	Focal-L (DyHead, multi-scale)	AP50	77.2	# 4
Object Detection	COCO minival	Focal-L (DyHead, multi-scale)	APL	73.4	# 4
Object Detection	COCO test-dev	Focal-L (DyHead, multi-scale)	box mAP	58.9	# 30
Instance Segmentation	COCO test-dev	Focal-L (HTC++, multi-scale)	mask AP	51.3	# 17
Instance Segmentation	COCO test-dev	Focal-L (HTC++, multi-scale)	AP50	75.4	# 5
Instance Segmentation	COCO test-dev	Focal-L (HTC++, multi-scale)	AP75	56.5	# 4
Instance Segmentation	COCO test-dev	Focal-L (HTC++, multi-scale)	APS	35.6	# 4
Instance Segmentation	COCO test-dev	Focal-L (HTC++, multi-scale)	APL	64.2	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/instance-segmentation-on-coco)](https://paperswithcode.com/sota/instance-segmentation-on-coco?p=focal-self-attention-for-local-global)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/instance-segmentation-on-coco-minival)](https://paperswithcode.com/sota/instance-segmentation-on-coco-minival?p=focal-self-attention-for-local-global)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=focal-self-attention-for-local-global)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/object-detection-on-coco)](https://paperswithcode.com/sota/object-detection-on-coco?p=focal-self-attention-for-local-global)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=focal-self-attention-for-local-global)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/focal-self-attention-for-local-global/semantic-segmentation-on-ade20k)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=focal-self-attention-for-local-global)`

Focal Self-attention for Local-Global Interactions in Vision Transformers

1 Jul 2021 · Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao ·

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.

PDF Abstract

Code

Add Remove Mark official

microsoft/Focal-Transformer official

542

BR-IDL/PaddleViT

1,185

microsoft/esvit

403

Tasks

Add Remove

Image Classification

Instance Segmentation

object-detection

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

Results from the Paper

Edit

Ranked #17 on Instance Segmentation on COCO test-dev

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	Focal-L (UperNet, ImageNet-22k pretrain)	Validation mIoU	55.40	# 41	Compare
Semantic Segmentation	ADE20K val	Focal-L (UperNet, ImageNet-22k pretrain)	mIoU	55.4	# 28	Compare
Instance Segmentation	COCO minival	Focal-L (HTC++, multi-scale)	mask AP	50.9	# 19	Compare
Object Detection	COCO minival	Focal-L (DyHead, multi-scale)	box AP	58.7	# 31	Compare
			AP50	77.2	# 4	Compare
			APL	73.4	# 4	Compare
Object Detection	COCO test-dev	Focal-L (DyHead, multi-scale)	box mAP	58.9	# 30	Compare
Instance Segmentation	COCO test-dev	Focal-L (HTC++, multi-scale)	mask AP	51.3	# 17	Compare
			AP50	75.4	# 5	Compare
			AP75	56.5	# 4	Compare
			APS	35.6	# 4	Compare
			APL	64.2	# 6	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Focal Transformers • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Focal Self-attention for Local-Global Interactions in Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove