Transformers

Parallel Layers

Introduced by Chowdhery et al. in PaLM: Scaling Language Modeling with Pathways

• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as:
y = x + MLP(LayerNorm(x + Attention(LayerNorm(x)))

Whereas the parallel formulation can be written as:
y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))

The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.

Source: PaLM: Scaling Language Modeling with Pathways

Papers


Paper Code Results Date Stars

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories