Bias Decay Matters : Improving Large Batch Optimization with Connectivity Sharpness
As deep learning becomes computationally intensive, the data parallelism is an essential option for the efficient training of high-performance models. Accordingly, the recent studies deal with the methods for increasing batch size in training the model. Many recent studies focused on learning rate, which determines the noise scale of parameter updates~\citep{goyal2017accurate, you2017large, You2020Large} and found that a high learning rate is essential for maintaining generalization performance and flatness of the local minimizers~\citep{Jastrzebski2020The, cohen2021gradient, lewkowycz2020large}. But to fill the performance gap that still exists in the large batch optimization, we study a method to directly control the flatness of local minima. Toward this, we define yet another sharpness measure called \textit{Connectivity sharpness}, a reparameterization invariant, structurally separable sharpness measure. Armed with this measure, we experimentally found the standard \textit{no bias decay heuristic}~\citep{goyal2017accurate, he2019bag}, which recommends the bias parameters and $\gamma$ and $\beta$ in BN layers are left unregularized in training, is a crucial reason for performance degradation in large batch optimization. To mitigate this issue, we propose simple bias decay methods including a novel adaptive one and found that this simple remedy can fill a large portion of the performance gaps that occur in large batch optimization.
PDF Abstract