Quantization

1046 papers with code • 10 benchmarks • 18 datasets

Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).

Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers

Benchmarks

Add a Result

These leaderboards are used to track progress in Quantization

Dataset	Best Model	Compare
ImageNet	FQ-ViT (ViT-L)	See all
CIFAR-10	3DCNN_VIVA_3	See all
Knowledge-based:	3DCNN_VIVA_5	See all
MS COCO	SSD ResNet50 V1 FPN 640x640	See all
LFW		See all
CFP-FP		See all
AgeDB-30		See all
IJB-C		See all
IJB-B		See all
Wiki-40B	OutEffHop-Bert_base	See all

Libraries

Use these libraries to find Quantization models and implementations

microsoft/DeepSpeed

8 papers

32,813

faceonlive/ai-research

5 papers

186

UCMerced-ML/LC-model-compression

5 papers

huggingface/transformers

4 papers

125,385

See all 5 libraries.

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Q8BERT: Quantized 8Bit BERT

NervanaSystems/nlp-architect • • 14 Oct 2019

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks.

Paper
Code

ConveRT: Efficient and Accurate Conversational Representations from Transformers

golsun/dialogrpt • • Findings of the Association for Computational Linguistics 2020

General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they are computationally heavy, slow, and expensive to train.

Paper
Code

Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond

KaidiXu/auto_LiRPA • • NeurIPS 2020

Linear relaxation based perturbation analysis (LiRPA) for neural networks, which computes provable linear bounds of output neurons given a certain amount of input perturbation, has become a core component in robustness verification and certified defense.

Paper
Code

TernaryBERT: Distillation-aware Ultra-low Bit BERT

huawei-noah/Pretrained-Language-Model • • EMNLP 2020

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices.

Paper
Code

Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance

jingtaozhan/JPQ • • 2 Aug 2021

Compared with previous DR models that use brute-force search, JPQ almost matches the best retrieval performance with 30x compression on index size.

Paper
Code

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

vllm-project/vllm • • 1 Jun 2023

We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights.

Paper
Code

End-to-end Learning of Deep Visual Representations for Image Retrieval

almazan/deep-image-retrieval • • 25 Oct 2016

Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it.

Paper
Code

Trained Ternary Quantization

tensorpack/tensorpack • • 4 Dec 2016

To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values.

Paper
Code

Quantizing deep convolutional networks for efficient inference: A whitepaper

KwangHoonAn/Quantizations • • 21 Jun 2018

Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training produces classification accuracies within 2% of floating point networks for a wide variety of CNN architectures.

Paper
Code

Fast Adjustable Threshold For Uniform Neural Network Quantization (Winning solution of LPIRC-II)

NervanaSystems/distiller • • 19 Dec 2018

It can be performed without fine-tuning using calibration procedure (calculation of parameters necessary for quantization), or it is possible to train the network with quantization from scratch.

Paper
Code

Quantization

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result