Methods > General > Optimization

Gradient Clipping


One difficulty that arises with optimization of deep neural networks is that large parameter gradients can lead an SGD optimizer to update the parameters strongly into a region where the loss function is much greater, effectively undoing much of the work that was needed to get to the current solution.

Gradient Clipping clips the size of the gradients to ensure optimization performs more reasonably near sharp areas of the loss surface. It can be performed in a number of ways. One option is to simply clip the parameter gradient element-wise before a parameter update. Another option is to clip the norm ||$\textbf{g}$|| of the gradient $\textbf{g}$ before a parameter update:

$$\text{ if } ||\textbf{g}|| > v \text{ then } \textbf{g} \leftarrow \frac{\textbf{g}^{v}}{||\textbf{g}||}$$

where $v$ is a norm threshold.

Source: Deep Learning, Goodfellow et al

Image Source: Pascanu et al

Latest Papers

PAPER DATE
Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness
Vien V. MaiMikael Johansson
2021-02-12
High-Performance Large-Scale Image Recognition Without Normalization
| Andrew BrockSoham DeSamuel L. SmithKaren Simonyan
2021-02-11
Robustness Threats of Differential Privacy
Nurislam TursynbekAleksandr PetiushkoIvan Oseledets
2020-12-14
GOAT: GPU Outsourcing of Deep Learning Training With Asynchronous Probabilistic Integrity Verification Inside Trusted Execution Environment
Aref AsvadishirehjiniMurat KantarciogluBradley Malin
2020-10-17
Facilitate the Parametric Dimension Reduction by Gradient Clipping
Chien-Hsun LaiYu-Shuen Wang
2020-09-30
Scaling up Differentially Private Deep Learning with Fast Per-Example Gradient Clipping
Jaewoo LeeDaniel Kifer
2020-09-07
Training Deep Neural Networks Without Batch Normalization
Divya GaurJoachim FolzAndreas Dengel
2020-08-18
AutoClip: Adaptive Gradient Clipping for Source Separation Networks
| Prem SeetharamanGordon WichernBryan PardoJonathan Le Roux
2020-07-25
Understanding Gradient Clipping in Private SGD: A Geometric Perspective
Xiangyi ChenZhiwei Steven WuMingyi Hong
2020-06-27
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
| Pengcheng YinGraham NeubigWen-tau YihSebastian Riedel
2020-05-17
Differentially Private Generation of Small Images
| Justus T. C. SchwabedalPascal MichelMario S. Riontino
2020-05-02
Can gradient clipping mitigate label noise?
| Aditya Krishna MenonAnkit Singh RawatSashank J. ReddiSanjiv Kumar
2020-05-01
Understanding Generalization in Recurrent Neural Networks
Zhuozhuo TuFengxiang HeDaCheng Tao
2020-05-01
Removing Disparate Impact of Differentially Private Stochastic Gradient Descent on Model Accuracy
Depeng XuWei DuXintao Wu
2020-03-08
A Self-Tuning Actor-Critic Algorithm
Tom ZahavyZhongwen XuVivek VeeriahMatteo HesselJunhyuk OhHado van HasseltDavid SilverSatinder Singh
2020-02-28
Towards Unified INT8 Training for Convolutional Neural Network
Feng ZhuRuihao GongFengwei YuXianglong LiuYanfei WangZhelong LiXiuqi YangJunjie Yan
2019-12-29
Why are Adaptive Methods Good for Attention Models?
Jingzhao ZhangSai Praneeth KarimireddyAndreas VeitSeungyeon KimSashank J. ReddiSanjiv KumarSuvrit Sra
2019-12-06
IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks
Michael LuoJiahao YaoRichard LiawEric LiangIon Stoica
2019-11-30
Compressive Transformers for Long-Range Sequence Modelling
| Jack W. RaeAnna PotapenkoSiddhant M. JayakumarTimothy P. Lillicrap
2019-11-13
TorchBeast: A PyTorch Platform for Distributed RL
| Heinrich KüttlerNantas NardelliThibaut LavrilMarco SelvaticiViswanath SivakumarTim RocktäschelEdward Grefenstette
2019-10-08
CTRL: A Conditional Transformer Language Model for Controllable Generation
| Nitish Shirish KeskarBryan McCannLav R. VarshneyCaiming XiongRichard Socher
2019-09-11
Why gradient clipping accelerates training: A theoretical justification for adaptivity
| Jingzhao ZhangTianxing HeSuvrit SraAli Jadbabaie
2019-05-28
Differential Privacy Has Disparate Impact on Model Accuracy
| Eugene BagdasaryanVitaly Shmatikov
2019-05-28
Towards Combining On-Off-Policy Methods for Real-World Applications
Kai-Chun HuChen-Huan PiTing Han WeiI-Chen WuStone ChengYi-Wei DaiWei-Yuan Ye
2019-04-24
Feature Intertwiner for Object Detection
| Hongyang LiBo DaiShaoshuai ShiWanli OuyangXiaogang Wang
2019-03-28
Classification of Medication-Related Tweets Using Stacked Bidirectional LSTMs with Context-Aware Attention
| Orest Xherija
2018-10-01
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
| Lasse EspeholtHubert SoyerRemi MunosKaren SimonyanVolodymir MnihTom WardYotam DoronVlad FiroiuTim HarleyIain DunningShane LeggKoray Kavukcuoglu
2018-02-05
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
| Wei PingKainan PengAndrew GibianskySercan O. ArikAjay KannanSharan NarangJonathan RaimanJohn Miller
2017-10-20
Riemannian approach to batch normalization
| Minhyung ChoJaehyung Lee
2017-09-27
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
| Wei WenCong XuFeng YanChunpeng WuYandan WangYiran ChenHai Li
2017-05-22
Language Modeling with Gated Convolutional Networks
| Yann N. DauphinAngela FanMichael AuliDavid Grangier
2016-12-23
Improving Neural Language Models with a Continuous Cache
| Edouard GraveArmand JoulinNicolas Usunier
2016-12-13
A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition
Albert ZeyerPatrick DoetschPaul VoigtlaenderRalf SchlüterHermann Ney
2016-06-22
Rethinking the Inception Architecture for Computer Vision
| Christian SzegedyVincent VanhouckeSergey IoffeJonathon ShlensZbigniew Wojna
2015-12-02

Tasks

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories