Towards Fully 8-bit Integer Inference for the Transformer Model

17 Sep 2020 Ye Lin Yanyang Li Tengbo Liu Tong Xiao Tongran Liu Jingbo Zhu

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization... (read more)

PDF Abstract

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper