GLU Variants Improve Transformer

12 Feb 2020 · Noam Shazeer ·

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Code

Add Remove Mark official

labmlai/annotated_deep_learning_pap…

↳ View annotated code at

47,906

BlinkDL/RWKV-LM

↳ Quickstart in

11,615

lucidrains/audiolm-pytorch

2,244

lucidrains/reformer-pytorch

↳ Quickstart in

2,053

lucidrains/naturalspeech2-pytorch

1,202

See all 21 implementations

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • GeGLU • GELU • GLU • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReGLU • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • SwiGLU • Test • Transformer