Global and Sliding Window Attention

Introduced by Beltagy et al. in Longformer: The Long-Document Transformer

Global and Sliding Window Attention is an attention pattern for attention-based models. It is motivated by the fact that non-sparse attention in the original Transformer formulation has a self-attention component with $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs.

Since windowed and dilated attention patterns are not flexible enough to learn task-specific representations, the authors of the Longformer add “global attention” on few pre-selected input locations. This attention is operation symmetric: that is, a token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it. The Figure to the right shows an example of a sliding window attention with global attention at a few tokens at custom locations. For the example of classification, global attention is used for the [CLS] token, while in the example of Question Answering, global attention is provided on all question tokens.

Source: Longformer: The Long-Document Transformer

Latest Papers

PAPER DATE
Longformer for MS MARCO Document Re-ranking Task
| Ivan SekulićAmir SoleimaniMohammad AliannejadiFabio Crestani
2020-09-20
Efficient Transformers: A Survey
Yi TayMostafa DehghaniDara BahriDonald Metzler
2020-09-14
Fine-Tune Longformer for Jointly Predicting Rumor Stance and Veracity
Anant Khandelwal
2020-07-15
Document Classification for COVID-19 Literature
Bernal Jiménez GutiérrezJuncheng ZengDongdong ZhangPing ZhangYu Su
2020-06-15
Longformer: The Long-Document Transformer
| Iz BeltagyMatthew E. PetersArman Cohan
2020-04-10

Tasks

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories