no code implementations • NeurIPS 2021 • Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis
While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism.
no code implementations • 3 Jun 2020 • Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis
While the initial analysis of Local SGD showed it needs $\Omega ( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(nT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $\Omega \left( n \left( \mbox{ polynomial in log } (T) \right) \right)$ communications.
1 code implementation • 9 Nov 2018 • Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis
We consider the standard model of distributed optimization of a sum of functions $F(\bz) = \sum_{i=1}^n f_i(\bz)$, where node $i$ in a network holds the function $f_i(\bz)$.
Optimization and Control