Factorized Random Synthesized Attention

Introduced by Tay et al. in Synthesizer: Rethinking Self-Attention in Transformer Models

Factorized Random Synthesized Attention, introduced with the Synthesizer architecture, is similar to factorized dense synthesized attention but for random synthesizers. Letting $R$ being a randomly initialized matrix, we factorize $R$ into low rank matrices $R_{1}, R_{2} \in \mathbb{R}^{l\text{ x}k}$ in the attention function:

$$ Y = \text{Softmax}\left(R_{1}R_{2}^{T}\right)G\left(X\right) . $$

Here $G\left(.\right)$ is a parameterized function that is equivalent to $V$ in Scaled Dot-Product Attention.

For each head, the factorization reduces the parameter costs from $l^{2}$ to $2\left(lk\right)$ where $k << l$ and hence helps prevent overfitting. In practice, we use a small value of $k = 8$.

The basic idea of a Random Synthesizer is to not rely on pairwise token interactions or any information from individual token but rather to learn a task-specific alignment that works well globally across many samples.

Source: Synthesizer: Rethinking Self-Attention in Transformer Models

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Abstractive Text Summarization	1	11.11%
Dialogue Generation	1	11.11%
Document Summarization	1	11.11%
Language Modelling	1	11.11%
Linguistic Acceptability	1	11.11%
Machine Translation	1	11.11%
Semantic Textual Similarity	1	11.11%
Text Generation	1	11.11%
Translation	1	11.11%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Softmax	Output Functions

Categories

Add Remove

Attention Mechanisms

Synthesized Attention Mechanisms