TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
RGB Salient Object Detection	DUTS-TE	P2T-Small	MAE	0.029	# 7
RGB Salient Object Detection	DUTS-TE	P2T-Small	max F-measure	0.912	# 5
RGB Salient Object Detection	DUTS-TE	P2T-Tiny	MAE	0.033	# 9
RGB Salient Object Detection	DUTS-TE	P2T-Tiny	max F-measure	0.895	# 7

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/p2t-pyramid-pooling-transformer-for-scene/salient-object-detection-on-duts-te)](https://paperswithcode.com/sota/salient-object-detection-on-duts-te?p=p2t-pyramid-pooling-transformer-for-scene)`

P2T: Pyramid Pooling Transformer for Scene Understanding

22 Jun 2021 · Yu-Huan Wu, Yun Liu, Xin Zhan, Ming-Ming Cheng ·

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

PDF Abstract

Code

Add Remove Mark official

yuhuan-wu/P2T official

178

yuhuan-wu/mobilesal

yuhuan-wu/RDPNet

yuhuan-wu/EDN

Tasks

Add Remove

Image Classification

Instance Segmentation

object-detection

Object Detection

RGB Salient Object Detection

Saliency Detection

Scene Understanding

Semantic Segmentation

Datasets

MS COCO

ADE20K ImageNet-1K

WMT 2014

DUTS

Results from the Paper

Edit

Ranked #5 on RGB Salient Object Detection on DUTS-TE (max F-measure metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
RGB Salient Object Detection	DUTS-TE	P2T-Small	MAE	0.029	# 7	Compare
RGB Salient Object Detection	DUTS-TE	P2T-Small	max F-measure	0.912	# 5	Compare
RGB Salient Object Detection	DUTS-TE	P2T-Tiny	MAE	0.033	# 9	Compare
RGB Salient Object Detection	DUTS-TE	P2T-Tiny	max F-measure	0.895	# 7	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

P2T: Pyramid Pooling Transformer for Scene Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove