Routing Transformer

Introduced by Roy et al. in Efficient Content-Based Sparse Attention with Routing Transformers

The Routing Transformer is a Transformer that endows self-attention with a sparse routing module based on online k-means. Each attention module considers a clustering of the space: the current timestep only attends to context belonging to the same cluster. In other word, the current time-step query is routed to a limited number of context through its cluster assignment.

Source: Efficient Content-Based Sparse Attention with Routing Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Zero-Shot Learning	1	12.50%
Long Form Question Answering	1	12.50%
Open-Domain Dialog	1	12.50%
Open-Domain Question Answering	1	12.50%
Question Answering	1	12.50%
Text Generation	1	12.50%
Image Generation	1	12.50%
Language Modelling	1	12.50%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adam	Stochastic Optimization
Dense Connections	Feedforward Networks
Dropout	Regularization
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
ReLU	Activation Functions
Residual Connection	Skip Connections
Routing Attention	Attention Patterns
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions

Categories

Add Remove

Transformers

Autoregressive Transformers