Patch Merger Module

Introduced by Renggli et al. in Learning to Merge Tokens in Vision Transformers

PatchMerger is a module for Vision Transformers that decreases the number of tokens/patches passed onto each individual transformer encoder block whilst maintaining performance and reducing compute. PatchMerger takes linearly transforms an input of shape N patches × D dimensions through a learnable weight matrix of shape M output patches × D. This generates M scores, in which a Softmax function is applied for each score. The resulting output has a shape of M × N, which is multiplied to the original input to get an output of shape M × D.

Mathematically, $$Y = \text{softmax}({W^T}{X^T})X$$

Image and formula from: Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to Merge Tokens in Vision Transformers. arXiv preprint arXiv:2202.12015.

Source: Learning to Merge Tokens in Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Natural Language Understanding	1	100.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Image Model Blocks