no code implementations • 20 Feb 2024 • Dongyang Fan, Bettina Messmer, Martin Jaggi
In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels.
2 code implementations • 26 May 2023 • Atli Kosson, Bettina Messmer, Martin Jaggi
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation.