Adafactor

Introduced by Shazeer et al. in Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n \times m$ matrix, this reduces the memory requirements from $O(n m)$ to $O(n + m)$.

Instead of defining the optimization algorithm in terms of absolute step sizes {$\alpha_t$}$_{t=1}^T$, the authors define the optimization algorithm in terms of relative step sizes {$\rho_t$}$_{t=1}^T$, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant $\epsilon_2$. The reason for this lower bound is to allow zero-initialized parameters to escape 0.

Proposed hyperparameters are: $\epsilon_{1} = 10^{-30}$, $\epsilon_{2} = 10^{-3}$, $d=1$, $p_{t} = \min\left(10^{-2}, \frac{1}{\sqrt{t}}\right)$, $\hat{\beta}_{2_{t}} = 1 - t^{-0.8}$.

Source: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Latest Papers

PAPER DATE
Scientific Claim Verification with VERT5ERINI
Ronak PradeepXueguang MaRodrigo NogueiraJimmy Lin
2020-10-22
mT5: A massively multilingual pre-trained text-to-text transformer
| Linting XueNoah ConstantAdam RobertsMihir KaleRami Al-RfouAditya SiddhantAditya BaruaColin Raffel
2020-10-22
Parameter Norm Growth During Training of Transformers
William MerrillVivek RamanujanYoav GoldbergRoy SchwartzNoah Smith
2020-10-19
Chatbot Interaction with Artificial Intelligence: Human Data Augmentation with T5 and Language Transformer Ensemble for Text Classification
Jordan J. BirdAnikó EkártDiego R. Faria
2020-10-12
TextSETTR: Label-Free Text Style Extraction and Tunable Targeted Restyling
Parker RileyNoah ConstantMandy GuoGirish KumarDavid UthusZarana Parekh
2020-10-08
Converting the Point of View of Messages Spoken to Virtual Assistants
| Isabelle G. LeeVera ZuSai Srujana BuddiDennis LiangJack G. M. FitzGerald
2020-10-06
MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems
Zhaojiang LinAndrea MadottoGenta Indra WinataPascale Fung
2020-09-25
Robustification of Segmentation Models Against Adversarial Perturbations In Medical Imaging
Hanwool ParkAmirhossein BayatMohammad SabokrouJan S. KirschkeBjoern H. Menze
2020-09-23
UCD-CS at W-NUT 2020 Shared Task-3: A Text to Text Approach for COVID-19 Event Extraction on Social Media
Congcong WangDavid Lillis
2020-09-21
Efficient Transformers: A Survey
Yi TayMostafa DehghaniDara BahriDonald Metzler
2020-09-14
PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data
| Diedre CarmoMarcos PiauIsrael CampiottiRodrigo NogueiraRoberto Lotufo
2020-08-20
Lite Training Strategies for Portuguese-English and English-Portuguese Translation
Alexandre LopesRodrigo NogueiraRoberto LotufoHelio Pedrini
2020-08-20
Investigating Pretrained Language Models for Graph-to-Text Generation
Leonardo F. R. RibeiroMartin SchmittHinrich SchützeIryna Gurevych
2020-07-16
HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections
Yi TayZhe ZhaoDara BahriDonald MetzlerDa-Cheng Juan
2020-07-12
Normalizador Neural de Datas e Endereços
Gustavo PlensackPaulo Finardi
2020-06-27
Text-to-Text Pre-Training for Data-to-Text Tasks
| Mihir Kale
2020-05-21
$R^3$: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge
| Tuhin ChakrabartyDebanjan GhoshSmaranda MuresanNanyun Peng
2020-04-28
Evaluating Machines by their Real-World Language Use
| Rowan ZellersAri HoltzmanElizabeth ClarkLianhui QinAli FarhadiYejin Choi
2020-04-07
TTTTTackling WinoGrande Schemas
Sheng-Chieh LinJheng-Hong YangRodrigo NogueiraMing-Feng TsaiChuan-Ju WangJimmy Lin
2020-03-18
Neural Machine Translation with Joint Representation
| Yanyang LiQiang WangTong XiaoTongran LiuJingbo Zhu
2020-02-16
Reformer: The Efficient Transformer
| Nikita KitaevŁukasz KaiserAnselm Levskaya
2020-01-13
Make Lead Bias in Your Favor: Zero-shot Abstractive News Summarization
Chenguang ZhuZiyi YangRobert GmyrMichael ZengXuedong Huang
2019-12-25
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
| Colin RaffelNoam ShazeerAdam RobertsKatherine LeeSharan NarangMichael MatenaYanqi ZhouWei LiPeter J. Liu
2019-10-23
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
| Noam ShazeerMitchell Stern
2018-04-11

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories