Mish: A Self Regularized Non-Monotonic Activation Function

BMVC 2020  ·  Diganta Misra ·

We propose $\textit{Mish}$, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=x\tanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural networks, we validated experimentally on several well-known benchmarks against the best combinations of architectures and activation functions. We also observe that data augmentation techniques have a favorable effect on benchmarks like ImageNet-1k and MS-COCO across multiple architectures. For example, Mish outperformed Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone on average precision ($AP_{50}^{val}$) by 2.1$\%$ in MS-COCO object detection and ReLU on ResNet-50 on ImageNet-1k in Top-1 accuracy by $\approx$1$\%$ while keeping all other network parameters and hyperparameters constant. Furthermore, we explore the mathematical formulation of Mish in relation with the Swish family of functions and propose an intuitive understanding on how the first derivative behavior may be acting as a regularizer helping the optimization of deep neural networks. Code is publicly available at https://github.com/digantamisra98/Mish.

PDF Abstract

Results from the Paper


Ranked #148 on Image Classification on CIFAR-100 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 ResNet 9 + Mish Percentage correct 94.05 # 148
Image Classification CIFAR-10 ResNet v2-20 (Mish activation) Percentage correct 92.02 # 168
Image Classification CIFAR-100 ResNet v2-110 (Mish activation) Percentage correct 74.41 # 148
Image Classification ImageNet CSPResNeXt-50 + Mish Top 1 Accuracy 79.8% # 676

Methods