Quantization
1046 papers with code • 10 benchmarks • 18 datasets
Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).
Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers
Libraries
Use these libraries to find Quantization models and implementationsDatasets
Most implemented papers
Q8BERT: Quantized 8Bit BERT
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks.
ConveRT: Efficient and Accurate Conversational Representations from Transformers
General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they are computationally heavy, slow, and expensive to train.
Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond
Linear relaxation based perturbation analysis (LiRPA) for neural networks, which computes provable linear bounds of output neurons given a certain amount of input perturbation, has become a core component in robustness verification and certified defense.
TernaryBERT: Distillation-aware Ultra-low Bit BERT
Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices.
Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance
Compared with previous DR models that use brute-force search, JPQ almost matches the best retrieval performance with 30x compression on index size.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights.
End-to-end Learning of Deep Visual Representations for Image Retrieval
Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it.
Trained Ternary Quantization
To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values.
Quantizing deep convolutional networks for efficient inference: A whitepaper
Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training produces classification accuracies within 2% of floating point networks for a wide variety of CNN architectures.
Fast Adjustable Threshold For Uniform Neural Network Quantization (Winning solution of LPIRC-II)
It can be performed without fine-tuning using calibration procedure (calculation of parameters necessary for quantization), or it is possible to train the network with quantization from scratch.