Towards Effective 2-bit Quantization: Pareto-optimal Bit Allocation for Deep CNNs Compression

25 Sep 2019 · Zhe Wang, Jie Lin, Mohamed M. Sabry Aly, Sean I Young, Vijay Chandrasekhar, Bernd Girod ·

State-of-the-art quantization methods can compress deep neural networks down to 4 bits without losing accuracy. However, when it comes to 2 bits, the performance drop is still noticeable. One problem in these methods is that they assign equal bit rate to quantize weights and activations in all layers, which is not reasonable in the case of high rate compression (such as 2-bit quantization), as some of layers in deep neural networks are sensitive to quantization and performing coarse quantization on these layers can hurt the accuracy. In this paper, we address an important problem of how to optimize the bit allocation of weights and activations for deep CNNs compression. We first explore the additivity of output error caused by quantization and find that additivity property holds for deep neural networks which are continuously differentiable in the layers. Based on this observation, we formulate the optimal bit allocation problem of weights and activations in a joint framework and propose a very efficient method to solve the optimization problem via Lagrangian Formulation. Our method obtains excellent results on deep neural networks. It can compress deep CNN ResNet-50 down to 2 bits with only 0.7% accuracy loss. To the best our knowledge, this is the first paper that reports 2-bit results on deep CNNs without hurting the accuracy.

PDF Abstract