Adafactor

Introduced by Shazeer et al. in Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$.

Instead of defining the optimization algorithm in terms of absolute step sizes {$\alpha_t$}$_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\rho_t$}$_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\epsilon_2$. The reason for this lower bound is to allow zero-initialized parameters to escape 0.

Proposed hyperparameters are: $\epsilon_{1} = 10^{-30}$, $\epsilon_{2} = 10^{-3}$, $d=1$, $p_{t} = \min\left(10^{-2}, \frac{1}{\sqrt{t}}\right)$, $\hat{\beta}_{2_{t}} = 1 - t^{-0.8}$.

Source: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	96	9.53%
Question Answering	63	6.26%
Sentence	45	4.47%
Text Generation	44	4.37%
Retrieval	31	3.08%
Translation	30	2.98%
Machine Translation	24	2.38%
Natural Language Understanding	20	1.99%
Semantic Parsing	19	1.89%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Stochastic Optimization

Large Batch Optimization