Shuffle Transformer

Introduced by Huang et al. in Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

The Shuffle Transformer Block consists of the Shuffle Multi-Head Self-Attention module (ShuffleMHSA), the Neighbor-Window Connection module (NWC), and the MLP module. To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, a strategy which alternates between WMSA and Shuffle-WMSA in consecutive Shuffle Transformer blocks is proposed. The first window-based transformer block uses regular window partition strategy and the second window-based transformer block uses window-based selfattention with spatial shuffle. Besides, the Neighbor-Window Connection moduel (NWC) is added into each block for enhancing connections among neighborhood windows. Thus the proposed shuffle transformer block could build rich cross-window connections and augments representation. Finally, the consecutive Shuffle Transformer blocks are computed as:

$$ x^{l}=\mathbf{W M S A}\left(\mathbf{B N}\left(z^{l-1}\right)\right)+z^{l-1} $$

$$ y^{l}=\mathbf{N W C}\left(x^{l}\right)+x^{l} $$

$$ z^{l}=\mathbf{M L P}\left(\mathbf{B N}\left(y^{l}\right)\right)+y^{l} $$

$$ x^{l+1}=\mathbf{S h u f f l e - W M S A}\left(\mathbf{B N}\left(z^{l}\right)\right)+z^{l} $$

$$ y^{l+1}=\mathbf{N W C}\left(x^{l+1}\right)+x^{l+1} $$

$$ z^{l+1}=\mathbf{M L P}\left(\mathbf{B N}\left(y^{l+1}\right)\right)+y^{l+1} $$

where $x^l$, $y^l$ and $z^l$ denote the output features of the (Shuffle-)WMSA module, the Neighbor-Window Connection module and the MLP module for block $l$, respectively; WMSA and Shuffle-WMSA denote window-based multi-head self-attention without/with spatial shuffle, respectively.

Source: Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	1	33.33%
Object Detection	1	33.33%
Semantic Segmentation	1	33.33%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision Transformers