Vision Transformers

ConViT is a type of vision transformer that uses a gated positional self-attention module (GPSA), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.

Source: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Image Classification 2 50.00%
Language Modelling 1 25.00%
Fine-Grained Image Classification 1 25.00%

Categories