Attention Patterns

The original self-attention component in the Transformer architecture has a $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Attention pattern methods look to reduce this complexity by looking at a subset of the space.