Image Model Blocks

Patch Merger Module

Introduced by Renggli et al. in Learning to Merge Tokens in Vision Transformers

PatchMerger is a module for Vision Transformers that decreases the number of tokens/patches passed onto each individual transformer encoder block whilst maintaining performance and reducing compute. PatchMerger takes linearly transforms an input of shape N patches × D dimensions through a learnable weight matrix of shape M output patches × D. This generates M scores, in which a Softmax function is applied for each score. The resulting output has a shape of M × N, which is multiplied to the original input to get an output of shape M × D.

Mathematically, $$Y = \text{softmax}({W^T}{X^T})X$$

Image and formula from: Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to Merge Tokens in Vision Transformers. arXiv preprint arXiv:2202.12015.

Source: Learning to Merge Tokens in Vision Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Natural Language Understanding 1 100.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories