Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons

28 Sep 2020 · Paul Micaelli, Amos Storkey ·

Gradient-based meta-learning has earned a widespread popularity in few-shot learning, but remains broadly impractical for tasks with long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn meta-parameters online, but this introduces greediness which comes with a significant performance drop. In this work, we enable non-greedy meta-learning of hyperparameters over long horizons by sharing hyperparameters that are contiguous in time, and using the sign of hypergradients rather than their magnitude to indicate convergence. We implement this with forward-mode differentiation, which we extend to the popular momentum-based SGD optimizer. We demonstrate that the hyperparameters of this optimizer can be learned non-greedily without gradient degradation over $\sim 10^4$ inner gradient steps, by only requiring $\sim 10$ outer gradient steps. On CIFAR-10, we outperform greedy and random search methods for the same computational budget by nearly $10\%$. Code will be available upon publication.

PDF Abstract