Accelerating DNN Training through Selective Localized Learning

1 Jan 2021 · Sarada Krithivasan, Sanchari Sen, Swagath Venkataramani, Anand Raghunathan ·

Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. We proposeLoCal+SGD, a new algorithmic approach to accelerate DNN train-ing by selectively combining localized or Hebbian learning within a StochasticGradient Descent (SGD) based training framework. Back-propagation is a computationally expensive process that requires 2 Generalized Matrix Multiply (GEMM)operations to compute the error and weight gradients for each layer. We alleviate this by selectively updating some layers’ weights using localized learning rules that require only 1 GEMM operation per layer. Further, since the weight update is performed during the forward pass itself, the layer activations for the mini-batch do not need to be stored until the backward pass, resulting in a reduced memory footprint. Localized updates can substantially boost training speed, but need to be used selectively and judiciously in order to preserve accuracy and convergence. We address this challenge through the design of a Learning Mode Selection Algorithm, where all layers start with SGD, and as epochs progress, layers gradually transition to localized learning. Specifically, for each epoch, the algorithm identifies a Localized→SGDtransition layer, which delineates the network into two regions. Layers before the transition layer use localized updates, while the transition layer and later layers use gradient-based updates. The trend in the weight updates made to the transition layer across epochs is used to determine how the boundary betweenSGD and localized updates is shifted in future epochs. We also propose a low-cost weak supervision mechanism by controlling the learning rate of localized updates based on the overall training loss. We appliedLoCal+SGDto 8 image recognition CNNs (including ResNet50 and MobileNetV2) across 3 datasets (Cifar10, Cifar100and ImageNet). Our measurements on a Nvidia GTX 1080Ti GPU demonstrate upto 1.5×improvement in end-to-end training time with∼0.5% loss in Top-1classification accuracy.

PDF Abstract