Patch-wise Mixed-Precision Quantization of Vision Transformer

11 May 2023  ·  Junrui Xiao, Zhikai Li, Lianwei Yang, Qingyi Gu ·

As emerging hardware begins to support mixed bit-width arithmetic computation, mixed-precision quantization is widely used to reduce the complexity of neural networks. However, Vision Transformers (ViTs) require complex self-attention computation to guarantee the learning of powerful feature representations, which makes mixed-precision quantization of ViTs still challenging. In this paper, we propose a novel patch-wise mixed-precision quantization (PMQ) for efficient inference of ViTs. Specifically, we design a lightweight global metric, which is faster than existing methods, to measure the sensitivity of each component in ViTs to quantization errors. Moreover, we also introduce a pareto frontier approach to automatically allocate the optimal bit-precision according to the sensitivity. To further reduce the computational complexity of self-attention in inference stage, we propose a patch-wise module to reallocate bit-width of patches in each layer. Extensive experiments on the ImageNet dataset shows that our method greatly reduces the search cost and facilitates the application of mixed-precision quantization to ViTs.

PDF Abstract
No code implementations yet. Submit your code now

Datasets


Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here