Multi-DConv-Head Attention Explained

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**Multi-DConv-Head Attention**, or **MDHA**, is a type of [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention) that utilizes [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution) after the multi-head projections. It is used in the [Primer](https://paperswithcode.com/method/primer) [Transformer](https://paperswithcode.com/method/transformer) architecture.

Specifically, 3x1 depthwise convolutions are added after each of the multi-head projections for query $Q$, key $K$ and value $V$ in self-attention. These depthwise convolutions are performed over the spatial dimension of each dense projection’s output. Interestingly, this ordering of pointwise followed by depthwise convolution is the reverse of typical [separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution), which the authors find to be less effective. They also find that wider depthwise convolution and [standard convolution](https://paperswithcode.com/method/convolution) not only do not improve performance, but in several cases hurt it.

MDHA is similar to [Convolutional Attention](https://paperswithcode.com/method/cvt), which uses [separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution) instead of depthwise convolution and does not apply convolution operations per attention head as in MDHA.

Code Snippet URL (optional):

Image

Currently: methods/3080f470-be1b-48d6-b1e8-43edb0a71739.png Clear
Change:

Attached collections:

ATTENTION MODULES

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Language Modelling	5	35.71%
Protein Structure Prediction	1	7.14%
Sentiment Analysis	1	7.14%
Common Sense Reasoning	1	7.14%
Coreference Resolution	1	7.14%
Natural Language Inference	1	7.14%
Question Answering	1	7.14%
Text Classification	1	7.14%
Word Sense Disambiguation	1	7.14%

Component	Type	Add Remove
Depthwise Convolution	Convolutions
Scaled Dot-Product Attention	Attention Mechanisms

Multi-DConv-Head Attention

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove