Quantization

1046 papers with code • 10 benchmarks • 18 datasets

Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).

Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers

Libraries

Use these libraries to find Quantization models and implementations

Most implemented papers

Q8BERT: Quantized 8Bit BERT

NervanaSystems/nlp-architect 14 Oct 2019

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks.

ConveRT: Efficient and Accurate Conversational Representations from Transformers

golsun/dialogrpt Findings of the Association for Computational Linguistics 2020

General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they are computationally heavy, slow, and expensive to train.

Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond

KaidiXu/auto_LiRPA NeurIPS 2020

Linear relaxation based perturbation analysis (LiRPA) for neural networks, which computes provable linear bounds of output neurons given a certain amount of input perturbation, has become a core component in robustness verification and certified defense.

TernaryBERT: Distillation-aware Ultra-low Bit BERT

huawei-noah/Pretrained-Language-Model EMNLP 2020

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices.

Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance

jingtaozhan/JPQ 2 Aug 2021

Compared with previous DR models that use brute-force search, JPQ almost matches the best retrieval performance with 30x compression on index size.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

vllm-project/vllm 1 Jun 2023

We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights.

End-to-end Learning of Deep Visual Representations for Image Retrieval

almazan/deep-image-retrieval 25 Oct 2016

Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it.

Trained Ternary Quantization

tensorpack/tensorpack 4 Dec 2016

To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values.

Quantizing deep convolutional networks for efficient inference: A whitepaper

KwangHoonAn/Quantizations 21 Jun 2018

Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training produces classification accuracies within 2% of floating point networks for a wide variety of CNN architectures.

Fast Adjustable Threshold For Uniform Neural Network Quantization (Winning solution of LPIRC-II)

NervanaSystems/distiller 19 Dec 2018

It can be performed without fine-tuning using calibration procedure (calculation of parameters necessary for quantization), or it is possible to train the network with quantization from scratch.