Layer Normalization

Introduced by Ba et al. in Layer Normalization

Unlike batch normalization, Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for RNNs and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with Transformer models.

We compute the layer normalization statistics over all the hidden units in the same layer as follows:

$$ \mu^{l} = \frac{1}{H}\sum^{H}_{i=1}a_{i}^{l} $$

$$ \sigma^{l} = \sqrt{\frac{1}{H}\sum^{H}_{i=1}\left(a_{i}^{l}-\mu^{l}\right)^{2}} $$

where $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\mu$ and $\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.

Source: Layer Normalization

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	52	6.63%
Retrieval	38	4.85%
Question Answering	32	4.08%
Semantic Segmentation	29	3.70%
Large Language Model	27	3.44%
Object Detection	13	1.66%
Image Segmentation	13	1.66%
Sentence	13	1.66%
In-Context Learning	13	1.66%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Normalization