Transformers

The BP-Transformer (BPT) is a type of Transformer that is motivated by the need to find a better balance between capability and computational complexity for self-attention. The architecture partitions the input sequence into different multi-scale spans via binary partitioning (BP). It incorporates an inductive bias of attending the context information from fine-grain to coarse-grain as the relative distance increases. The farther the context information is, the coarser its representation is. BPT can be regard as graph neural network, whose nodes are the multi-scale spans. A token node can attend the smaller-scale span for the closer context and the larger-scale span for the longer distance context. The representations of nodes are updated with Graph Self-Attention.

Source: BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 1 20.00%
Machine Translation 1 20.00%
Sentiment Analysis 1 20.00%
Text Classification 1 20.00%
Translation 1 20.00%

Categories