Adaptive Gradient Methods Can Be Provably Faster than SGD with Random Shuffling

1 Jan 2021 · Xunpeng Huang, Vicky Jiaqi Zhang, Hao Zhou, Lei LI ·

Adaptive gradient methods have been shown to outperform SGD in many tasks of training neural networks. However, the acceleration effect is yet to be explained in the non-convex setting, since the best convergence rate of adaptive gradient methods is worse than that of SGD in literature. In this paper, we prove that adaptive gradient methods exhibit an $\small\tilde{O}(T^{-1/2})$-convergence rate for finding first-order stationary points under some mild assumptions, which improves previous best convergence results of adaptive gradient methods and SGD by factors of $\small O(T^{-1/4})$ and $\small O(T^{-1/6})$, respectively. In particular, we study two variants of AdaGrad with random shuffling and identify a novel consistency condition from general experiments result. Our analysis suggests that the combination of random shuffling and adaptive learning rates gives rise to better convergence.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

AdaGrad • SGD

Edit Social Preview

Adaptive Gradient Methods Can Be Provably Faster than SGD with Random Shuffling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove