A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

ICLR 2021 · Zeke Xie, Issei Sato, Masashi Sugiyama ·

Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum with a large neighboring region in the parameter space from which each weight vector has similar small error. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness, gradient noise and hyperparameters. We verify an interesting fact that the stochastic gradient noise covariance is nearly proportional to the Hessian and inverse to the batch size near minima. We prove that, benefited from stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent with injected white noise favors flat minima only polynomially more than sharp minima. We also prove that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of batch size and learning rate, and thus cannot search flat minima efficiently in a realistic computational time.

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Stochastic Optimization

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

SGD

Edit Social Preview

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove