TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	ResNet-50	Top 1 Accuracy	72.1%	# 927
Language Modelling	WikiText-103	Transformer (Adaptive inputs)	Validation perplexity	19.5	# 18
Machine Translation	WMT2016 English-German	Transformer	BLEU score	26.7	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/on-the-adequacy-of-untuned-warmup-for/machine-translation-on-wmt2016-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2016-english-german?p=on-the-adequacy-of-untuned-warmup-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/on-the-adequacy-of-untuned-warmup-for/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=on-the-adequacy-of-untuned-warmup-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/on-the-adequacy-of-untuned-warmup-for/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=on-the-adequacy-of-untuned-warmup-for)`

On the adequacy of untuned warmup for adaptive optimization

9 Oct 2019 · Jerry Ma, Denis Yarats ·

Adaptive optimization algorithms such as Adam are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, recent work proposes automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. We then provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over $2 / (1 - \beta_2)$ training iterations.

PDF Abstract

Code

Add Remove Mark official

Tony-Y/pytorch_warmup

↳ Quickstart in

Colab

358

Tasks

Add Remove

Image Classification

Language Modelling

Machine Translation

Datasets

ImageNet

WikiText-2

WikiText-103

WMT 2016

Results from the Paper

Add Remove

Ranked #6 on Machine Translation on WMT2016 English-German

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	ResNet-50	Top 1 Accuracy	72.1%	# 927	Compare
Language Modelling	WikiText-103	Transformer (Adaptive inputs)	Validation perplexity	19.5	# 18	Compare
Machine Translation	WMT2016 English-German	Transformer	BLEU score	26.7	# 6	Compare

Methods

Add Remove

1x1 Convolution • Absolute Position Encodings • Adam • Average Pooling • Batch Normalization • Bottleneck Residual Block • BPE • Convolution • Dense Connections • Dropout • Global Average Pooling • Kaiming Initialization • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup • Max Pooling • Multi-Head Attention • Position-Wise Feed-Forward Layer • RAdam • ReLU • Residual Block • Residual Connection • ResNet • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

On the adequacy of untuned warmup for adaptive optimization

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove