Quantization
1039 papers with code • 10 benchmarks • 18 datasets
Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16).
Source: Adaptive Precision Training: Quantify Back Propagation in Neural Networks with Fixed-point Numbers
Libraries
Use these libraries to find Quantization models and implementationsDatasets
Latest papers
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression.
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts
Unlike other LoRA based MoE methods, MixLoRA enhances model performance by utilizing independently configurable attention-layer LoRA adapters, supporting the use of LoRA and its variants for the construction of experts, and applying auxiliary load balance loss to address the imbalance problem of the router.
MAexp: A Generic Platform for RL-based Multi-Agent Exploration
The sim-to-real gap poses a significant challenge in RL-based multi-agent exploration due to scene quantization and action discretization.
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points
However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry.
Variational quantization for state space models
The main challenge is to model a rich variety of time series, leverage any available external signals and provide sharp predictions with statistical guarantees.
Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators
Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors, including a weight quantization strategy (i. e., data types and bit-widths) and mapping (i. e., placement and scheduling of DNN elementary operations on hardware units of the accelerator).
David and Goliath: An Empirical Evaluation of Attacks and Defenses for QNNs at the Deep Edge
To fill this gap, we empirically evaluate the effectiveness of attacks and defenses from (full-precision) ANNs on (constrained) QNNs.
BinaryDM: Towards Accurate Binarization of Diffusion Model
With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs.
Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model Merging
Model merging is a promising lightweight model empowerment technique that does not rely on expensive computing devices (e. g., GPUs) or require the collection of specific training data.
Weakly Supervised Deep Hyperspherical Quantization for Image Retrieval
Deep quantization methods have shown high efficiency on large-scale image retrieval.