Q8BERT: Quantized 8Bit BERT

14 Oct 2019  ·  Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat ·

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

PDF Abstract

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Linguistic Acceptability CoLA Q8BERT (Zafrir et al., 2019) Accuracy 65.0 # 24
Semantic Textual Similarity MRPC Q8BERT (Zafrir et al., 2019) Accuracy 89.7 # 17
Natural Language Inference MultiNLI Q8BERT (Zafrir et al., 2019) Matched 85.6 # 27
Natural Language Inference QNLI Q8BERT (Zafrir et al., 2019) Accuracy 93.0 # 22
Natural Language Inference RTE Q8BERT (Zafrir et al., 2019) Accuracy 84.8 # 26
Sentiment Analysis SST-2 Binary classification Q8BERT (Zafrir et al., 2019) Accuracy 94.7 # 31
Semantic Textual Similarity STS Benchmark Q8BERT (Zafrir et al., 2019) Pearson Correlation 0.911 # 13

Methods