Attention Modules

Multi-DConv-Head Attention

Introduced by So et al. in Primer: Searching for Efficient Transformers for Language Modeling

Multi-DConv-Head Attention, or MDHA, is a type of Multi-Head Attention that utilizes depthwise convolutions after the multi-head projections. It is used in the Primer Transformer architecture.

Specifically, 3x1 depthwise convolutions are added after each of the multi-head projections for query $Q$, key $K$ and value $V$ in self-attention. These depthwise convolutions are performed over the spatial dimension of each dense projection’s output. Interestingly, this ordering of pointwise followed by depthwise convolution is the reverse of typical separable convolution, which the authors find to be less effective. They also find that wider depthwise convolution and standard convolution not only do not improve performance, but in several cases hurt it.

MDHA is similar to Convolutional Attention, which uses separable convolution instead of depthwise convolution and does not apply convolution operations per attention head as in MDHA.

Source: Primer: Searching for Efficient Transformers for Language Modeling

Papers


Paper Code Results Date Stars

Categories