Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

20 Nov 2019  ·  Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, Haibin Lin ·

When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm for smooth but non-convex problems. Empirical results show that the proposed algorithm significantly reduces the communication overhead, which, in turn, reduces the training time by up to 30% for the 1B word dataset.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods