Small but Mighty: A Technical Deep-Dive into 1.58-bit Quantization

Optimium

Solution

Company

Resources

Contact

Select Language

Optimium

Solution

Company

Resources

Contact

Select Language

Technology

Small but Mighty: A Technical Deep-Dive into 1.58-bit Quantization

Hi, this is Changbeom from ENERZAi. In our previous posts, we explored the landscape of AI model quantization and shared how we applied 1.58-bit quantization with custom kernels to Whisper. Today, we’re going one step deeper — digging into the technical details of 1.58-bit quantization, BitNet architecture, and why this matters for real-world edge AI deployments.

Changbeom Kang

August 11, 2025

Hi, this is Changbeom from ENERZAi. In our previous posts, we explored the landscape of AI model quantization and shared how we applied 1.58-bit quantization with custom kernels to Whisper. Today, we’re going one step deeper — digging into the technical details of 1.58-bit quantization, BitNet architecture, and why this matters for real-world edge AI deployments.

BitNet: Rethinking Transformers at 1 Bit

In 2023, Microsoft introduced BitNet, a bold attempt to break the efficiency barriers of large language models (LLMs). BitNet keeps the standard Transformer architecture intact but replaces traditional linear layers with a novel BitLinear layer — quantizing weights down to just 1 bit.

BitNet: Scaling 1-bit Transformers for Large Language ModelsThe increasing size of large language models has posed challenges for deployment and raised concerns about…arxiv.org

But here’s where it gets more interesting.

The BitNet-b1.58 model uses a ternary quantization scheme with weights restricted to {−1, 0, +1}, which effectively encodes ~1.58 bits per weight (log₂3). This tiny trick results in over 90% memory savings compared to FP16 — all while keeping performance surprisingly high.

BitLinear: How It Works

At the core of BitNet lies the BitLinear operation. Here's how it works:

Weight binarization: weights are mapped to +1 or -1 via the Sign function.
Centralization: zero-centering the weights increases representation capacity.
Scaling: a learnable β parameter minimizes error between binary and real-valued representations.
Activation quantization: activations are quantized using 8-bit absmax, forming a W1A8 structure.

What’s fascinating is that BitNet exhibits similar scaling laws to full-precision models. As the model size increases, the performance gap between FP16 and BitNet shrinks, hinting that extreme low-bit quantization might not just be compression — it could be an entirely new computing paradigm.

Another standout feature of BitNet is its energy efficiency. According to measurements on 7nm hardware, BitNet consumes ~40× less energy for multiplication operations and ~3× less for addition operations compared to standard FP16 Transformers.

When applied to a 30B parameter model, BitNet reduces total energy consumption by a staggering 38.8×. That’s a game-changer — enabling practical deployment of LLMs on battery-powered edge devices.

Why QAT Is Crucial for Extreme Low-bit Models

Thanks to recent advances in quantization techniques, Post-Training Quantization (PTQ) can now deliver strong performance even at 4-bit precision. But once you dip below 4 bits, things change dramatically. In this extreme low-bit regime, PTQ alone just isn’t enough — Quantization-Aware Training (QAT) becomes essential.

Here’s why: PTQ works by applying quantization after pretraining and this works fine down to 4 bits. But at 2 bits or lower, models begin to suffer sharp drops in performance.

In our own experiments at ENERZAi, we applied PTQ to a Whisper Small model, quantizing it to 2 bits. The result?

A Word Error Rate (WER) of 37.06% — a massive degradation that renders the model practically unusable.

These findings are also echoed in the ParetoQ paper. One particularly interesting observation from their research is the pattern of weight changes that occur during quantization training of a 16-bit model.

ParetoQ: Scaling Laws in Extremely Low-bit LLM QuantizationThe optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of…arxiv.org

When quantization is applied to this 16-bit model, the amount of change in the weights varies significantly depending on the target bit-width, as described in below figure. For models quantized to 3 bits, the weights only shift by about 10–20%. However at 2 bits, 1.58 bits, and 1 bit, the changes become much more dramatic — exceeding 40%.

This behavior is closely related to what we described in our earlier post — the way the optimization landscape shifts during QAT:

At 3–4 bits, the model adjusts its weights slightly around their original values — a form of compensation.
At 1–2 bits, the model needs to learn entirely new representations — a full reconstruction process.

QAT enables this by simulating quantization noise through fake quantization, guiding the model to converge toward wide minima — flatter regions in the loss landscape that are more tolerant to weight perturbations. This makes the model far more resilient to quantization, preserving performance even at extreme low bit widths.

Quantization, A Core Technology for Building ‘Lightweight’ AI ModelsHello, this is Sungmin Woo from the Business Development Team at ENERZAi. As I briefly mentioned in the previous post…medium.com

LUT Kernels: Making 1.58-bit Work in the Real World

One of the biggest challenges in deploying extreme low-bit quantized models in the real world is hardware compatibility. In particular, for 1.58-bit quantization, the main technical hurdle lies in efficiently executing mixed-precision matrix multiplications (mpGEMM). Most existing hardware architectures are optimized for symmetric precision operations — such as W16A16 or W8A8 — where both weights and activations share the same bit-width.

But in our case, we’re also working with asymmetric setups like W1.58A16, which are not natively supported by most hardware platforms.

As a workaround, many systems fall back on dequantization — converting low-bit weights back to high-precision floats before computation. However, this comes at a cost: The lower the bit-width, the greater the dequantization overhead. In fact, we’ve observed that moving from 4-bit to 1-bit quantization actually results in increased latency, purely due to the rising cost of dequantization.

T-MAC: A Paradigm Shift in Low-bit Computation

T-MAC fundamentally addresses this challenge by shifting the computation paradigm to bit-level operations using a Look-Up Table (LUT)-based approach. The core idea is to transform conventional, data type–centric computation into a series of bitwise operations:

A × W = A × (Σ(2^i × Wi)) = Σ(2^i × A × Wi)

In other words, an n-bit weight matrix is decomposed into n separate 1-bit matrices, each processed sequentially. The results are then aggregated to reconstruct the final computation.

While this may sound abstract in equation form, let’s walk through a concrete example based on our actual implementation of 1.58-bit Whisper.

Suppose we take 4 consecutive input values — a, b, c, d. With 1.58-bit quantization, each weight can take on one of three values: {−1, 0, +1}.

That results in 3⁴ = 81 possible combinations between the weights and the inputs.

We can precompute all these combinations into a lookup table with 81 entries. During inference, instead of storing the full set of 4 weights, we store a single index into this table. The computation becomes a simple table lookup.

The benefit?

Originally, this would have required mixed-precision multiply-and-add operations between 8-bit inputs and 1.58-bit weights — an inefficient setup. But now, we can reduce the entire operation to a lookup of a precomputed sum, calculated using just addition and subtraction of 8-bit inputs.

That said, this approach introduces new challenges.

To safely handle additions/subtractions of 8-bit inputs without overflow, we need 16-bit intermediate values.
An LUT with 81 entries, each 16 bits, results in a 1296-bit table.

However, most hardware architectures can’t load a 1296-bit table in a single lookup. For example, on ARM architectures, table lookups are typically limited to 128-bit or 256-bit chunks of 8-bit elements.

So we need to split our 1296-bit table into two 648-bit tables, each holding 8-bit entries. If we use 256-bit lookups, we need three lookups per table (since 3×256 = 768 > 648), totaling six lookups for every computation.

Even though lookups are relatively efficient, performing six of them per operation — and maintaining such a large table in registers — becomes increasingly inefficient.

Here’s a neat trick to solve this problem.

With 1.58-bit quantization, weights can only take values from {−1, 0, +1}. Now, notice something important: +1 and −1 differ only by their sign, and 0 is unaffected by sign changes.

This means any {0, −1} combination can be derived directly from the corresponding {0, +1} combination. For example, if a + b = 16, then −a − b = −16 ! By leveraging this property, we can reduce the problem from looking up all 3⁴ = 81 ternary combinations to looking up just the 2⁴ = 16 binary {0, 1} combinations.

Recalculating the table size:

Before: 81 entries × 16 bits = 1296 bits
After: 16 entries × 16 bits = 256 bits

This is a huge reduction. With the smaller table, on ARM architectures we can store it as a 128-bit table of 8-bit elements, requiring just two lookups to retrieve the positive part, plus two more lookups for the negative part. That’s 4 lookups total, down from 6 — a 33% reduction in lookup operations. And because the table is now only 256 bits, it can be easily kept in registers, making the entire process significantly more efficient.

The Practical Value of the 1.58-bit Kernel

To validate the real-world performance of our 1.58-bit quantization, we ran a precision-by-precision performance analysis on a Raspberry Pi 5.

Hardware Specifications:

Processor: ARM Cortex-A76, 4 cores @ 2.4 GHz
Memory: LPDDR4X-4267, 8 GB
Cache: L1 64 KB + L2 512 KB (per core), L3 2 MB (shared)
Memory Bandwidth: Theoretical 17.1 GB/s (measured 8–12 GB/s)
Test Environment: 4 threads used

We began with a GEMV (General Matrix–Vector multiplication) benchmark. GEMV is a good proxy for measuring performance in the auto-regressive decoding phase of LLMs, and it is well-known as a memory-bound operation in LLM inference workloads.

The graph above shows the results of a GEMV operation where we multiply an (n, n) matrix by an (n, 1) vector, gradually increasing the dimension n and measuring execution time for each precision level. The x-axis represents the dimension n, and the y-axis shows the execution time in microseconds (µs).

The results clearly show that lower precisions — w1.58 < w8 < w16 < w32 — deliver better performance, and the performance gap widens as the dimension increases. This behavior is a textbook example of a memory-bound operation.

Using the data from this experiment, we can calculate the memory bandwidth for each precision as shown below.

The graph shows the memory bandwidth (GB/s) for each precision level in the GEMV operation, multiplying an (n, n) matrix by an (n, 1) vector as the dimension n increases.

As n grows, all precision levels converge toward the Raspberry Pi 5’s measured memory bandwidth limit of 8–12 GB/s.

One interesting detail is that for each precision, there’s a region where the measured bandwidth temporarily exceeds this limit. This happens because cache hits dominate in that range.

If we calculate the matrix size (dimension n) needed to saturate the Raspberry Pi 5’s combined L2 + L3 cache (4 MB), we get:

w32: 1000
w16: 1414
w8: 2000
w1.58: 4000

These values align closely with the points at which each precision’s bandwidth drops sharply and converges to the Pi 5’s memory bandwidth limit.

In short, the GEMV benchmark results confirm what’s widely known: the auto-regressive decoding phase in LLMs is memory-bound. In such memory-bound operations, the 1.58-bit kernel delivers significantly better performance than higher-precision kernels.

Although the 1.58-bit kernel uses mixed precision, it executes operations efficiently through its LUT-based approach. As a result, it delivers competitive — and in many cases superior — performance compared to other kernels that rely solely on single-precision computation.

1.58-bit Whisper

Using the 1.58-bit kernel we just explored — together with ENERZAi’s inference optimization engine, Optimium — we successfully developed a 1.58-bit version of the Whisper model.

Our optimization target hardware was the Synaptics SL1680, with the following specs:

Processor: ARM Cortex-A73, 4 cores @ 2.1 GHz
Memory: LPDDR4X-3733, 8 GB
Cache: 64 KB I-cache + 32 KB D-cache per core, 1 MB shared L2
Test Environment: 4 threads used

WER Performance: Extreme Low-bit Speech Recognition Without Accuracy Loss

We benchmarked our 1.58-bit Whisper Small model against several precision baselines.

From left to right: FP16 (baseline), Q4 (4-bit) PTQ, 1.58-bit QAT, and 2-bit–8-group PTQ, measuring Word Error Rate (WER) in each case. The results were striking: Enerzai’s 1.58-bit model showed only a 0.39 percentage point drop in WER compared to FP16 — essentially on par with the 4-bit model. Given that 1.58-bit offers an 8× compression ratio over FP16, this is an exceptional result. And when compared to the 2-bit–8-group PTQ model’s 14.07% WER, the superiority of the 1.58-bit QAT approach becomes even clearer. This serves as concrete, empirical evidence for the necessity of QAT in extreme low-bit quantization.

Memory Usage: Dramatic Gains in Efficiency

One of the biggest advantages of the 1.58-bit quantized model is its drastically reduced weight memory footprint — improving not only decoding performance but also enabling deployment on memory-constrained edge devices. To quantify this, we compared memory usage across FP16, Q4 (4-bit), and 1.58-bit models.

The results clearly show the benefit:The 1.58-bit model achieves a 77.3% reduction in total memory usage compared to FP16. Even more noteworthy is the drop in model weight memory — from 487 MB down to just 89.2 MB, an 81.6% decrease.

This means that with 1.58-bit quantization, you can run larger models or process longer sequences within the same memory constraints, unlocking higher performance in edge environments.

In fact, the Whisper Small 1.58-bit model’s total memory usage (143 MB) is almost the same as that of the smaller Whisper Base Q4 model (132 MB).

However, because the Small model is larger than the Base model, its accuracy is higher:

Whisper Small 1.58-bit achieves a 6.38% WER, outperforming both Whisper Base FP16 (7.53%) and Whisper Base Q4 (8.25%).

This means that on existing hardware, where memory limitations previously forced you to choose a smaller, less accurate model, you can now run the higher-performing Small model with nearly the same memory footprint.

Latency: Real-Time Inference Speeds

Our latency measurements further demonstrate the practicality of 1.58-bit quantization.

The 1.58-bit model achieved 2.46× faster inference compared to FP16, and even 26% faster than the Q4 model.

Interestingly, the performance gap between FP16 and the 1.58-bit kernel measured on the Astra board was larger than what we observed in the earlier GEMM and GEMV benchmarks on the Raspberry Pi 5.

The reason is straightforward: the Cortex-A73 does not support FP16-specific vector instructions, making FP16 computations inherently inefficient on this hardware.

In contrast, the LUT-based operations used by the 1.58-bit kernel are supported on most architectures, which allows it to run far more efficiently — leading to the significant performance difference.

In this post, we’ve shared a more in-depth look at our 1.58-bit quantization technology, which we’ve introduced several times before. Beyond Whisper, we implement various models in extreme-low bit precision so that our clients can experience cutting-edge AI technology across a wider range of devices. If you have any questions regarding ENERZAi’s technology or solutions, please feel free to reach out at any time.

In our next post, we’ll share real-world examples of how the 1.58-bit Whisper model — detailed in today’s deep dive — can be applied to practical Edge AI applications, such as voice control.

Optimium

Solutions

Company

Resources

ENERZAi

Business number: 246-86-01405