Optimium

Solutions

Company

Resources

Contact

Select Language

Optimium

Solutions

Company

Resources

Contact

Select Language

Company

Quantization, A Core Technology for Building ‘Lightweight’ AI Models

Since edge devices have far fewer hardware resources than servers, we need model-compression techniques that make large, high-performance AI models lighter and faster without sacrificing accuracy. In this post, we’ll dive into the most prominent of these techniques: Quantization.

Sungmin Woo

2025년 6월 10일

Hello, this is Sungmin Woo from the Business Development Team at ENERZAi. As I briefly mentioned in the previous post, AI models are rapidly making their way into a wide range of devices — cars, smartphones, robots, and home appliances. Accordingly, interest is growing in the techniques required to run high-performance AI models on-device at the edge.

Since edge devices have far fewer hardware resources than servers, we need model-compression techniques that make large, high-performance AI models lighter and faster without sacrificing accuracy. In this post, we’ll dive into the most prominent of these techniques: Quantization.

What Is Quantization?

In deep learning, quantization reduces memory usage and computation by representing a model’s weights or activations with a lower bit-width. Most ML frameworks — PyTorch, TensorFlow, and others — store numbers as 32-bit floating-point values (FP32). While FP32 offers a wide dynamic range and high precision, it also consumes significant memory and increases compute demand. To cut training time and cost, many teams now train in FP16 or use mixed-precision training (FP16 + FP32). For more details on floating-point formats, please refer to the link below.

Optimium 101 (4) - Mixed Precision Inference

Quantization often means converting FP32 values to INT8. In theory this alone cuts memory by 4x, though the real-world gain is usually smaller because you may quantize only the weights and must store extra values such as scale factors and clipping ranges. Beyond memory savings, quantization delivers major boosts in inference speed and energy efficiency — especially because most NPUs are designed for INT8 or lower. Research into INT4 and INT2 quantization is now active, driven by the need to deploy large language models (LLMs) efficiently.

Source: Advances in the Neural Network Quantization: A Comprehensive Review

The trade-off: a narrower numerical range inevitably introduces some accuracy loss. Countless quantization methods aim to minimize that loss while reaping the speed-and-size benefits.

Quantization Methodologies

Choosing how many bits to quantize to is only half the story. Selecting the right method for a given model architecture and data distribution is just as important. Below is a concise overview of the major approaches..

Uniform vs non-Uniform

Source: INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE

Choosing how many bits to quantize to is only half the story. Selecting the right method for a given model architecture and data distribution is just as important. Below is a concise overview of the major approaches.

Uniform Quantization maps real numbers to integers at a fixed step size. A single scale factor is all that’s needed, making implementation straightforward on common CPU/GPU targets. It’s also called linear quantization.

Non-uniform Quantization adjusts the step size based on the data distribution, allowing higher precision where values cluster densely. A classic example is log-scale quantization, ideal when most values lie near zero. The downside: you need LUTs or custom logic, which complicates deplo

Symmetric vs Asymmetric

yment.

Source: Weighted-Entropy-based Quantization for Deep Neural Network

Uniform quantization typically uses an affine mapping:

x_q = round(x/S + Z)

x_q: Quantized output that corresponds to real number x
S: Scale factor. (ex. mapping real [2, 6] → int [0, 255] gives S = (6 − 2)/255 = 0.0157)
Z: Zero-point. The integer that represents real zero

If both the real and integer ranges are centered on zero (e.g., real [−5, 5] → int [−127, 127]), we get symmetric quantization (Z = 0). Otherwise, the scheme is asymmetric. Symmetric quantization is simpler and faster but slightly less flexible than asymmetric.

Source: A Survey of Quantization Methods for Efficient Neural Network Inference

Granularity

Granularity refers to how finely we assign quantization parameters:

Per-tensor: one scale/zero-point for the entire tensor
Per-channel (or per-group): different parameters for each channel or group

Finer granularity preserves accuracy but reduces memory- and latency-savings because you store and compute more parameters.

Source: Hugging Face

PTQ vs QAT

One of the other important factors to consider in quantization is whether to account for quantization during the model-training stage. Performing quantization on a pre-trained model without additional training is called Post-Training Quantization (PTQ), whereas incorporating the errors introduced by quantization from the training stage is called Quantization-Aware Training (QAT).

PTQ does not require extra training, enabling quick and efficient quantization. Depending on when quantization takes place, PTQ is further divided into dynamic quantization and static quantization. The process of determining the clipping range of incoming real-valued data to decide the scale factor is called calibration. For weights, calibration can be done all at once at the moment of quantization with no problems, but for activations the values vary with the input data, making it difficult to derive an accurate range before runtime.

Dynamic quantization refers to a method in which, at model runtime, quantization parameters for activations are set based on the input data and quantization is carried out. This effectively preserves accuracy when activations vary widely with the input, but because extra computation is required to obtain the quantization parameters, inference speed can slow down.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

Static quantization refers to a method in which a calibration dataset — whose distribution resembles the inputs that will be fed to the model during inference — is passed through the model to determine activation ranges, and the quantization parameters derived from those ranges are then used. Although there is no additional computational overhead during inference, this approach requires a separate calibration dataset; moreover, if the activation distribution is mis-estimated, accuracy can drop sharply.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

💡 Calibration techniques for determining the clipping range
1. Min-Max: Set the range using the smallest and largest values in the input data. Easiest to implement, but sensitive to outliers
2. Percentile: Exclude a certain percentage of the highest and lowest values in the input data (e.g., use only the 0.1 % — 99.9 % range)
3. Entropy (KL-divergence): Set the range to minimize the difference (KL divergence) between FP32 values and their quantized counterparts

QAT uses a method called fake quantization to simulate the errors introduced by quantization during the training phase of the AI model.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

Fake quantization virtually quantizes the weights and activations, then includes the resulting error in the loss function; because training minimizes this error, accuracy is better preserved than with PTQ. Note that this is purely “virtual” quantization: during training, both weights and activations are stored after being dequantized back to FP32, enabling exact gradients during back-propagation and successful weight updates.

Another characteristic of QAT is that, when searching for the point that minimizes the loss function, training tends to converge on a gentle wide minima.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

If training is conducted without considering quantization, weights are likely to end up at a narrow minima, where small weight changes cause large loss changes. If quantization is then applied, shifting the weights sharply increases the loss and produces a large quantization error. In contrast, if a wide minima is selected, the loss increases only slightly when weights are adjusted, minimizing quantization error.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

Therefore, to minimize accuracy loss during quantization, QAT is generally more advantageous than PTQ, and this advantage becomes even greater at lower bit-widths. However, QAT has the drawback of being very complex, time-consuming, and computationally intensive.

Emerging trend: Low-bit Quantization

Traditionally, quantization meant converting to INT8, but as large language models (LLMs) — which require massive memory and compute — proliferate, demand for even lower-bit quantization (INT4, INT2, and beyond) is growing. As precision drops, quantization error rises steeply, so QAT is the suitable approach for preserving model performance; yet, because the training difficulty is very high, active research is also under way on PTQ methods that minimize accuracy loss at 4 bits and below.

GPTQ

GPTQ is one of the most widely used methods for performing 4-bit quantization. Fundamentally, it is a layer-wise quantization that is carried out independently for each layer.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

In GPTQ, the inverse Hessian is used to evaluate the importance of each weight (how much that weight affects model performance). The Hessian, obtained by taking the second derivative of the loss function, represents the curvature with respect to each weight. A smaller inverse-Hessian value indicates a weight that is more critical to model performance.

Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

A key point in GPTQ is that when a specific weight is quantized, the method updates other weight values while considering how the resulting error affects them. Concretely, the quantization error generated for each weight is redistributed to other weights, with the error weighted by the inverse-Hessian value to reflect weight importance. In other words, important weights are adjusted only slightly, while less important weights are adjusted more, thereby minimizing the overall quantization error of the entire layer.

GGUF

GGUF is a file format developed to deploy AI models (especially LLMs) efficiently on various hardware, including CPUs as well as GPUs, and it supports multiple quantization schemes. Thus, when you save a model in GGUF, the model’s weights can be quantized in several ways as needed, and they are typically stored in formats such as Q4_K or Q8_0.

Source: Hugging Face

Using the example Q4_K_M shown in the figure above, let’s look at how GGUF models are named.

Q4: a model quantized to 4 bits
K: a model quantized with the K-Quant method. The weights are divided into a fixed number of blocks, and a different scale is applied to each block during quantization. If this character is 0 or 1 (ex. Q4_0), it means the model was simply quantized with a single scale.
M: when normalizing real-valued weights for quantization, the mean of each block was used. If this is S (ex. Q4_K_S), it means the block’s standard deviation was used for normalization.

AWQ

AWQ minimizes quantization error by considering the influence of activations when quantizing weights, while keeping certain highly important weights in floating-point (FP) format to preserve the original model’s performance.

Source: AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

The figure above shows the performance loss (perplexity — the lower, the better) that occurs when an FP16 language model is quantized. When all weights are simply rounded (RTN, Round to Nearest), performance drops steeply, but preserving the FP values of a subset of channels maintains the original model’s performance.

What’s next: Below 4-bit

As introduced above, research on methods such as GPTQ and AWQ, which can perform low-bit quantization efficiently while minimizing accuracy loss, is actively under way. However, most existing PTQ methods perform well down to 4 bits but become difficult to use at lower precision due to severe performance degradation.

Therefore, most existing sub-4-bit extreme low-bit models are designed from the training stage with architectures optimized for low-bit operations. Microsoft’s recently released BitNet b1.58 2B4T likewise replaces traditional high-precision linear layers with proprietary “BitLinear” layers that operate at just 1.58 bits.

Source: Microsoft

The figure above confirms that the BitNet b1.58 model is overwhelmingly superior to similarly sized language models in inference speed and memory/power efficiency. Even so, such extreme low-bit models have not yet been widely adopted, because training them is extremely difficult and there is currently no inference backend that supports operations below 4 bits.

ENERZAi recently succeeded in running a proprietary 1.58-bit Whisper-Small model on an Arm-based SoC using our next-generation AI inference engine Optimium. Going forward, we plan to continue R&D in AI compression and optimization to implement a broader range of low-bit models (LLMs, VLMs, and more) in an on-device form. We will cover the details of the ENERZAi team’s 1.58-bit Whisper model in a separate post, so please stay tuned!

Optimium

Solutions

Company

Resources

ENERZAi

사업자등록번호: 246-86-01405