ConViT

Introduced by d'Ascoli et al. in ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

ConViT is a type of vision transformer that uses a gated positional self-attention module (GPSA), a form of positional self-attention which can be equipped with a “soft” convolutional inductive bias. The GPSA layers are initialized to mimic the locality of convolutional layers, then each attention head is given the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information.

Source: ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	2	50.00%
Language Modelling	1	25.00%
Fine-Grained Image Classification	1	25.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
GPSA	Attention Modules
Residual Connection	Skip Connections

Categories

Add Remove

Vision Transformers

Image Models