Delay-Tolerant Local SGD for Efficient Distributed Training
The heavy communication for model synchronization is a major bottleneck for scaling up the distributed deep neural network training to many workers. Moreover, model synchronization can suffer from long delays in scenarios such as federated learning and geo-distributed training. Thus, it is crucial that the distributed training methods are both \textit{delay-tolerant} AND \textit{communication-efficient}. However, existing works cannot simultaneously address the communication delay and bandwidth constraint. To address this important and challenging problem, we propose a novel training framework OLCO\textsubscript{3} to achieve delay tolerance with a low communication budget by using stale information. OLCO\textsubscript{3} introduces novel staleness compensation and compression compensation to combat the influence of staleness and compression error. Theoretical analysis shows that OLCO\textsubscript{3} achieves the same sub-linear convergence rate as the vanilla synchronous stochastic gradient descent (SGD) method. Extensive experiments on deep learning tasks verify the effectiveness of OLCO\textsubscript{3} and its advantages over existing works.
PDF Abstract