LAMB is a a layerwise adaptive large batch optimization technique. It provides a strategy for adapting the learning rate in large batch settings. LAMB uses Adam as the base algorithm and then forms an update as:

$$r_{t} = \frac{m_{t}}{\sqrt{v_{t}} + \epsilon}$$ $$x_{t+1}^{\left(i\right)} = x_{t}^{\left(i\right)} - \eta_{t}\frac{\phi\left(|| x_{t}^{\left(i\right)} ||\right)}{|| m_{t}^{\left(i\right)} || }\left(r_{t}^{\left(i\right)}+\lambda{x_{t}^{\left(i\right)}}\right) $$

Unlike LARS, the adaptivity of LAMB is two-fold: (i) per dimension normalization with respect to the square root of the second moment used in Adam and (ii) layerwise normalization obtained due to layerwise adaptivity.

Source: Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Latest Papers

PAPER DATE
A Transformer Based Pitch Sequence Autoencoder with MIDI Augmentation
Mingshuo DingYinghao Ma
2020-10-15
An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models
Zihan ZhaoYuncong LiuLu ChenQi LiuRao MaKai Yu
2020-10-14
Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition
| Yun HeZiwei ZhuYin ZhangQin ChenJames Caverlee
2020-10-08
Pretrained Language Model Embryology: The Birth of ALBERT
| David C. ChiangSung-Feng HuangHung-Yi Lee
2020-10-06
On the Interplay Between Fine-tuning and Sentence-level Probing for Linguistic Knowledge in Pre-trained Transformers
| Marius MosbachAnna KhokhlovaMichael A. HedderichDietrich Klakow
2020-10-06
BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context
Jean-Philippe CorbeilHadi Abdi Ghadivel
2020-09-25
BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition
Usman NaseemMatloob KhushiVinay ReddySakthivel RajendranImran RazzakJinman Kim
2020-09-19
Learning Universal Representations from Word to Sentence
Yian LiHai Zhao
2020-09-10
Comparative Study of Language Models on Cross-Domain Data with Model Agnostic Explainability
Mayank ChhipaHrushikesh Mahesh VazurkarAbhijeet KumarMridul Mishra
2020-09-09
ERNIE at SemEval-2020 Task 10: Learning Word Emphasis Selection by Pre-trained Language Model
Zhengjie HuangShikun FengWeiyue SuXuyi ChenShuohuan WangJiaxiang LiuXuan OuyangYu Sun
2020-09-08
UPB at SemEval-2020 Task 8: Joint Textual and Visual Modeling in a Multi-Task Learning Architecture for Memotion Analysis
George-Alexandru VladGeorge-Eduard ZahariaDumitru-Clementin CercelCostin-Gabriel ChiruStefan Trausan-Matu
2020-09-06
Variants of BERT, Random Forests and SVM approach for Multimodal Emotion-Target Sub-challenge
Hoang Manh HungHyung-Jeong YangSoo-Hyung KimGuee-Sang Lee
2020-07-28
Deep Learning Brasil -- NLP at SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets
Manoel Veríssimo dos Santos NetoAyrton Denner da Silva AmaralNádia Félix Felipe da SilvaAnderson da Silva Soares
2020-07-28
Mono vs Multilingual Transformer-based Models: a Comparison across Several Language Tasks
Diego de Vargas FeijoViviane Pereira Moreira
2020-07-19
LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation using Pretraining Language Model
Shilei LiuYu GuoBochao LiFeiliang Ren
2020-07-06
A Transformer Approach to Contextual Sarcasm Detection in Twitter
Hunter GregorySteven LiPouya MohammadiNatalie TarnRachel DraelosCynthia Rudin
2020-07-01
Deep Investing in Kyle's Single Period Model
Paul FriedrichJosef Teichmann
2020-06-24
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes
Shuai ZhengHaibin LinSheng ZhaMu Li
2020-06-24
Adaptive Learning Rates with Maximum Variation Averaging
| Chen ZhuYu ChengZhe GanFurong HuangJingjing LiuTom Goldstein
2020-06-21
On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
| Marius MosbachMaksym AndriushchenkoDietrich Klakow
2020-06-08
BERT Loses Patience: Fast and Robust Inference with Early Exit
| Wangchunshu ZhouCanwen XuTao GeJulian McAuleyKe XuFuru Wei
2020-06-07
Scaling Distributed Training with Adaptive Summation
Saeed MalekiMadan MusuvathiTodd MytkowiczOlli SaarikiviTianju XuVadim EksarevskiyJaliya EkanayakeEmad Barsoum
2020-06-04
BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text
Tanvi DaduKartikey PantRadhika Mamidi
2020-06-01
Language Representation Models for Fine-Grained Sentiment Classification
Brian CheangBailey WeiDavid KoganHowey QiuMasud Ahmed
2020-05-27
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation
Po-Han ChiPei-Hung ChungTsung-Han WuChun-Cheng HsiehShang-Wen LiHung-yi Lee
2020-05-18
ImpactCite: An XLNet-based method for Citation Impact Analysis
Dominique MercierSyed Tahseen Raza RizviVikas RajashekarAndreas DengelSheraz Ahmed
2020-05-05
TAVAT: Token-Aware Virtual Adversarial Training for Language Understanding
Linyang LiXipeng Qiu
2020-04-30
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
| Sanyuan ChenYutai HouYiming CuiWanxiang CheTing LiuXiangzhan Yu
2020-04-27
UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection
Gregor WiedemannSeid Muhie YimamChris Biemann
2020-04-23
Investigating the Effectiveness of Representations Based on Pretrained Transformer-based Language Models in Active Learning for Labelling Text Datasets
Jinghui LuBrian MacNamee
2020-04-21
Gestalt: a Stacking Ensemble for SQuAD2.0
Mohamed El-Geish
2020-04-02
Deep Entity Matching with Pre-Trained Language Models
Yuliang LiJinfeng LiYoshihiko SuharaAnHai DoanWang-Chiew Tan
2020-04-01
Retrospective Reader for Machine Reading Comprehension
| Zhuosheng ZhangJunjie YangHai Zhao
2020-01-27
PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
| Saurabh GoyalAnamitra R. ChoudhurySaurabh M. RajeVenkatesan T. ChakaravarthyYogish SabharwalAshish Verma
2020-01-24
Perceiving the arrow of time in autoregressive motion
Kristof MedingDominik JanzingBernhard SchölkopfFelix A. Wichmann
2019-12-01
Single Headed Attention RNN: Stop Thinking With Your Head
| Stephen Merity
2019-11-26
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
| Zhenzhong LanMingda ChenSebastian GoodmanKevin GimpelPiyush SharmaRadu Soricut
2019-09-26
NEZHA: Neural Contextualized Representation for Chinese Language Understanding
Junqiu WeiXiaozhe RenXiaoguang LiWenyong HuangYi LiaoYasheng WangJiashu LinXin JiangXiao ChenQun Liu
2019-08-31
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
| Yang YouJing LiSashank ReddiJonathan HseuSanjiv KumarSrinadh BhojanapalliXiaodan SongJames DemmelKurt KeutzerCho-Jui Hsieh
2019-04-01

Components

COMPONENT TYPE
Adam
Stochastic Optimization

Categories