Attention Modules

Multi-Head Linear Attention

Introduced by Wang et al. in Linformer: Self-Attention with Linear Complexity

Multi-Head Linear Attention is a type of linear multi-head self-attention module, proposed with the Linformer architecture. The main idea is to add two linear projection matrices $E_{i}, F_{i} \in \mathbb{R}^{n\times{k}}$ when computing key and value. We first project the original $\left(n \times d\right)$-dimensional key and value layers $KW_{i}^{K}$ and $VW_{i}^{V}$ into $\left(k\times{d}\right)$-dimensional projected key and value layers. We then compute a $\left(n\times{k}\right)$ dimensional context mapping $\bar{P}$ using scaled-dot product attention:

$$ \bar{\text{head}_{i}} = \text{Attention}\left(QW^{Q}_{i}, E_{i}KW_{i}^{K}, F_{i}VW_{i}^{V}\right) $$

$$ \bar{\text{head}_{i}} = \text{softmax}\left(\frac{QW^{Q}_{i}\left(E_{i}KW_{i}^{K}\right)^{T}}{\sqrt{d_{k}}}\right) \cdot F_{i}VW_{i}^{V} $$

Finally, we compute context embeddings for each head using $\bar{P} \cdot \left(F_{i}{V}W_{i}^{V}\right)$.

Source: Linformer: Self-Attention with Linear Complexity

Papers


Paper Code Results Date Stars

Categories