A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

ICLR 2021  ·  Zeke Xie, Issei Sato, Masashi Sugiyama ·

Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum with a large neighboring region in the parameter space from which each weight vector has similar small error. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness, gradient noise and hyperparameters. We verify an interesting fact that the stochastic gradient noise covariance is nearly proportional to the Hessian and inverse to the batch size near minima. We prove that, benefited from stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent with injected white noise favors flat minima only polynomially more than sharp minima. We also prove that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of batch size and learning rate, and thus cannot search flat minima efficiently in a realistic computational time.

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods