no code implementations • 27 Feb 2024 • Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, Antonio Orvieto
In this work, we find empirical evidence that learning rate transfer can be attributed to the fact that under $\mu$P and its depth extension, the largest eigenvalue of the training loss Hessian (i. e. the sharpness) is largely independent of the width and depth of the network for a sustained period of training time.
1 code implementation • 3 Oct 2023 • Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Rätsch, Hadi Daneshmand
We answer this question in the affirmative by giving a particular construction of an Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded gradients at any depth.