Dilated Sliding Window Attention

Introduced by Beltagy et al. in Longformer: The Long-Document Transformer

Dilated Sliding Window Attention is an attention pattern for attention-based models. It was proposed as part of the Longformer architecture. It is motivated by the fact that non-sparse attention in the original Transformer formulation has a self-attention component with $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs.

Compared to a Sliding Window Attention pattern, we can further increase the receptive field without increasing computation by making the sliding window "dilated". This is analogous to dilated CNNs where the window has gaps of size dilation $d$. Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l × d × w$, which can reach tens of thousands of tokens even for small values of $d$.

Source: Longformer: The Long-Document Transformer

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	13	9.42%
Sentence	10	7.25%
Document Classification	10	7.25%
Question Answering	9	6.52%
Classification	6	4.35%
Text Classification	5	3.62%
Natural Language Inference	5	3.62%
Abstractive Text Summarization	5	3.62%
Text Summarization	4	2.90%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Attention Patterns