Attention Patterns

Edit

Natural Language Processing • Attention Mechanisms • 8 methods

The original self-attention component in the Transformer architecture has a $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Attention pattern methods look to reduce this complexity by looking at a subset of the space.

Methods

Add a Method

Method	Year	Papers
Strided Attention Generating Long Sequences with Sparse Transformers	2019	1324
Fixed Factorized Attention Generating Long Sequences with Sparse Transformers	2019	1323
Sliding Window Attention Longformer: The Long-Document Transformer	2020	84
Global and Sliding Window Attention Longformer: The Long-Document Transformer	2020	78
Dilated Sliding Window Attention Longformer: The Long-Document Transformer	2020	77
BigBird Big Bird: Transformers for Longer Sequences	2020	14
Neighborhood Attention Neighborhood Attention Transformer	2022	13
Routing Attention Efficient Content-Based Sparse Attention with Routing Transformers	2020	7

Attention Patterns Edit

Methods Add a Method

Attention Patterns

Edit

Methods

Add a Method