ConViT is a type of vision transformer that uses a gated positional self-attention module (GPSA), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.
Source: ConViT: Improving Vision Transformers with Soft Convolutional Inductive BiasesPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Classification | 2 | 50.00% |
Language Modelling | 1 | 25.00% |
Fine-Grained Image Classification | 1 | 25.00% |
Component | Type |
|
---|---|---|
Dense Connections
|
Feedforward Networks | |
GPSA
|
Attention Modules | |
Residual Connection
|
Skip Connections |