AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L_{2}$ regularization in Adam is usually implemented with the below modification where $w_{t}$ is the rate of the weight decay at time $t$:
$$ g_{t} = \nabla{f\left(\theta_{t}\right)} + w_{t}\theta_{t}$$
while AdamW adjusts the weight decay term to appear in the gradient update:
$$ \theta_{t+1, i} = \theta_{t, i}  \eta\left(\frac{1}{\sqrt{\hat{v}_{t} + \epsilon}}\cdot{\hat{m}_{t}} + w_{t, i}\theta_{t, i}\right), \forall{t}$$
Source:PAPER  DATE 

Longformer for MS MARCO Document Reranking Task

20200920 
Efficient Transformers: A Survey
• • • 
20200914 
FineTune Longformer for Jointly Predicting Rumor Stance and Veracity

20200715 
Document Classification for COVID19 Literature
• • • • 
20200615 
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

20200601 
Longformer: The LongDocument Transformer

20200410 
Automated Pavement Crack Segmentation Using UNetbased Convolutional Neural Network
• • • 
20200107 
Stochastic Gradient Methods with Layerwise Adaptive Moments for Training of Deep Networks

20190527 
A unified theory of adaptive stochastic gradient descent as Bayesian filtering

20190501 
Bayesian filtering unifies adaptive and nonadaptive neural network optimization methods

20180719 
Decoupled Weight Decay Regularization

20171114 
TASK  PAPERS  SHARE 

Document Classification  1  9.09% 
Language Modelling  1  9.09% 
Feature Engineering  1  9.09% 
Semantic Segmentation  1  9.09% 
Image Classification  1  9.09% 
MultiTask Learning  1  9.09% 
Rumour Detection  1  9.09% 
Stance Detection  1  9.09% 
Twitter Event Detection  1  9.09% 
COMPONENT  TYPE 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 