TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	UniFormer-L (384 res)	Top 1 Accuracy	86.3%	# 153
Image Classification	ImageNet	UniFormer-L (384 res)	Number of params	100M	# 868
Image Classification	ImageNet	UniFormer-L (384 res)	GFLOPs	39.2	# 411
Image Classification	ImageNet	UniFormer-L	Top 1 Accuracy	85.6%	# 209
Image Classification	ImageNet	UniFormer-L	Number of params	100M	# 868
Image Classification	ImageNet	UniFormer-L	GFLOPs	12.6	# 319
Image Classification	ImageNet	UniFormer-S	Top 1 Accuracy	83.4%	# 394
Image Classification	ImageNet	UniFormer-S	Number of params	22M	# 557
Image Classification	ImageNet	UniFormer-S	GFLOPs	3.6	# 181

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformer-unifying-convolution-and-self/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=uniformer-unifying-convolution-and-self)`

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

24 Jan 2022 · Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao ·

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on COCO pose estimation. We further build an efficient UniFormer with 2-4x higher throughput. Code is available at https://github.com/Sense-X/UniFormer.

PDF Abstract

Code

Add Remove Mark official

sense-x/uniformer

↳ Quickstart in

Spaces

780

sithu31296/semantic-segmentation

↳ Quickstart in

Colab

760

Westlake-AI/openmixup

572

leondgarse/keras_cv_attention_models

557

bespontaneous/ffn

See all 7 implementations

Tasks

Add Remove

Image Classification

object-detection

Object Detection

Pose Estimation

Representation Learning

Semantic Segmentation

Video Classification

Datasets

ImageNet

MS COCO

UCF101 ImageNet-1K

Something-Something V2

Kinetics-600

Something-Something V1

Results from the Paper

Edit

Ranked #153 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	UniFormer-L (384 res)	Top 1 Accuracy	86.3%	# 153	Compare
			Number of params	100M	# 868	Compare
			GFLOPs	39.2	# 411	Compare
Image Classification	ImageNet	UniFormer-L	Top 1 Accuracy	85.6%	# 209	Compare
			Number of params	100M	# 868	Compare
			GFLOPs	12.6	# 319	Compare
Image Classification	ImageNet	UniFormer-S	Top 1 Accuracy	83.4%	# 394	Compare
			Number of params	22M	# 557	Compare
			GFLOPs	3.6	# 181	Compare

Methods

Add Remove

Convolution

Edit Social Preview

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove