Parameter Continuation Methods for the Optimization of Deep Neural Networks

2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) 2019 · Harsh Nilesh Pathank, Randy Clinton Paffenroth ·

There are many extant methods for approximating the solutions of non-convex optimization problems arising in deep neural networks, including stochastic gradient descent, RMSProp, AdaGrad, and ADAM. In this paper, we propose a novel training strategy for deep neural networks based on ideas from numerical parameter continuation methods. Parameter continuation methods have a long history in many application domains such as bifurcation analysis and the study of systems of differential equations. However, as far as we are aware, such powerful methods have seen relatively limited use in the optimization of deep neural networks. Specifically, herein we derive a homotopy formulation of standard activation functions. Such a derivation allows one to decompose the optimization of deep neural networks into a sequence of optimization problems, each of which is armed with a good initial guess based upon the solution of the previous problem. The whole process is initiated by using a closed-form solution for the first of these problems provided by the homotopy formulation. Intuitively, there is a deep connection between our homotopy techniques and many ideas used in transfer and curriculum learning. However, our proposed methods leverage decades of theoretical and computational work in parameter continuation methods and can be viewed as an initial bridge between those techniques and deep neural networks. In particular, we propose a method that we call Natural Parameter Adaption Continuation with Secant approximation (NPACS). This method provides an effective optimization technique that uses standard algorithms such as ADAM in a novel way to achieve faster and more stable convergence. We demonstrate the effectiveness of our method on standard benchmark problems, and we compute local minima more quickly and with lower train and test loss values than current state-of-the-art techniques in a majority of cases.

PDF Abstract