Parallel Layers

Introduced by Chowdhery et al. in PaLM: Scaling Language Modeling with Pathways

• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as:
y = x + MLP(LayerNorm(x + Attention(LayerNorm(x)))

Whereas the parallel formulation can be written as:
y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))

The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.

Source: PaLM: Scaling Language Modeling with Pathways

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Auto Debugging	1	5.00%
Code Generation	1	5.00%
Common Sense Reasoning	1	5.00%
Coreference Resolution	1	5.00%
Cross-Lingual Question Answering	1	5.00%
Few-Shot Learning	1	5.00%
Hindu Knowledge	1	5.00%
Known Unknowns	1	5.00%
Language Modelling	1	5.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Transformers