Attention Mechanisms

Multiplicative Attention

Introduced by Luong et al. in Effective Approaches to Attention-based Neural Machine Translation

Multiplicative Attention is an attention mechanism where the alignment score function is calculated as:

$$f_{att}\left(\textbf{h}_{i}, \textbf{s}_{j}\right) = \mathbf{h}_{i}^{T}\textbf{W}_{a}\mathbf{s}_{j}$$

Here $\mathbf{h}$ refers to the hidden states for the encoder/source, and $\mathbf{s}$ is the hidden states for the decoder/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a softmax function of these alignment scores (ensuring it sums to 1).

Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality $d_{h}$ of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale $f_{att}\left(\textbf{h}_{i}, \textbf{s}_{j}\right)$ by $1/\sqrt{d_{h}}$ as with scaled dot-product attention.

Source: Deep Learning for NLP Best Practices by Sebastian Ruder

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Named Entity Recognition (NER) 1 12.50%
NER 1 12.50%
Speech Enhancement 1 12.50%
Image-guided Story Ending Generation 1 12.50%
Machine Translation 1 12.50%
NMT 1 12.50%
Sentence 1 12.50%
Translation 1 12.50%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories