1.58-bit Quantization

Optimium

Solution

Company

Resources

Contact

Select Language

Optimium

Solution

Company

Resources

Contact

Select Language

Technology

1.58-bit Quantization — the Wegovy for AI models 💉

ENERZAi is actively developing and deploying AI models quantized to 1.58-bit using Quantization-Aware Training(QAT), combined with our own inference engine and programming language — Optimium and Nadya — to build and optimize custom kernels designed specifically for 1.58-bit inference.

Hanhim Chang

July 25, 2025

Hi everyone! This is Daniel Chang from ENERZAi — a team deeply committed to Edge AI like no other. Our vision is to “deliver the best AI experience on everything for everyone” by developing high-performance AI technologies that enrich our lives — even in the harsh computing environments of edge devices compared to servers or datacenters.

Having this background, we recently published a blog post introducing Whisper, the most popular speech recognition model that actually “enrich our daily lives”:

Whisper, 음성 인식 AI의 혁신

At the very end of above article, we mentioned that we’ve showcased our 1.58-bit Whisper model, which we derived based our extreme low-bit quantization technology, at a recent exhibition. In this perspective, we took a deeper look at quantization itself in our most recent post:

Quantization, ‘가벼운’ AI 모델 구현을 위한 핵심 기술

In that post too, we wrapped up by highlighting our focus on sub-4-bit and extreme low-bit models. Looking back a bit further, we had also published a series on our custom-built inference backend, Optimium!

Optimium 탐구(1): 추론 최적화 기법

Optimium 탐구(8)-Nadya Optimizing Compiler

Today, I’d like to tie all three threads together and explain the why & how behind the 1.58-bit inference model that our team is currently focusing on most intensively.

TL;DR

As the use of Large Language Models(LLMs) such as speech recognition and translation models increases, efforts to deploy them on edge devices are also growing.
However, due to constraints in memory, power, and computation on edge devices, it’s not realistic to deploy these large models in their original 32-bit or 16-bit format. This has led to a growing need for extreme low-bit quantization — under 4-bits — to minimize memory usage and power consumption.
But quantizing a model below 4-bits without significant accuracy loss is technically very challenging. Even if you succeed in quantizing a model to such low precision, the lack of inference backends that can run these models in real-world settings makes them difficult to deploy.
ENERZAi is actively developing and deploying AI models quantized to 1.58-bit using Quantization-Aware Training(QAT), combined with our own inference engine and programming language — Optimium and Nadya — to build and optimize custom kernels designed specifically for 1.58-bit inference.
In this post, we’ll introduce a project where we minimized accuracy loss of the Whisper(Small) model, while reducing memory usage to less than a quarter, power consumption to less than a half, and improving processing speed by more than 2x.
If you’re looking to run large models on edge devices with minimal memory and power consumption through extreme low-bit inference techniques like 1.58-bit quantization — we’d love to hear from you!

Background

Chances are, many of you reading this are already using generative AI services like ChatGPT on a daily basis. I myself rely on conversational AI in areas as trivial as deciding what to eat or figuring out how to apologize to my angry wife. Clearly, AI has already become deeply embedded in our lives.

These generative AI services typically run in data centers where abundant power and compute resources are available. This is because stable and powerful server environments are a must in order to provide large language or multimodal models in real-time.

But if we shift our perspective slightly, we’ll notice that many of the AI technologies we interact with more frequently are actually running on devices around us. Facial recognition that unlocks your phone, voice assistants, and driver-assistance systems in vehicles mostly operate at the device level, not on servers. However, reproducing the “best AI experience” of massive data centers within the tiny confines of a handheld device is a major technical challenge.

Recent efforts in on-device AI now extend well beyond traditional CNN-based vision models to encompass Generative-AI-driven LLMs. Running such models locally, however, is challenging: LLMs dwarf CNNs in size, while edge devices face tight constraints on CPU/GPU cycles, DRAM capacity and bandwidth, power budget, and real-time responsiveness. In the cloud, dozens of gigabytes of memory and many parallel accelerators are readily available; on the edge, by contrast, limited memory and I/O bandwidth make data movement a primary performance bottleneck.

Moreover, because edge devices typically juggle multiple tasks rather than dedicating all resources to a single AI workload, 8-bit or 4-bit quantization might not be sufficient. Therefore, aggressive model compression and system-level optimization are no longer optional for Edge AI — they are mandatory.

Within this context, ENERZAi is actively applying 1.58-bit quantization to our Edge AI projects. In this post, we’ll share a case study of how we quantized a popular model to 1.58-bit, then implemented & deployed custom kernels using our in-house inference engine and programming language, Optimium and Nadya. We’ll also explain why we are uniquely positioned to deliver models with such extreme efficiency.

Quantization is a technique that reduces the size and computational load of neural network models by converting their weights and activations into lower-precision numerical representations.

Typically, AI models are trained and inferred using 32-bit floating point values. By converting these to 8-bit integers (int8) — or even lower — we can significantly reduce memory usage, improve computational speed, and lower power consumption.

This is especially effective for memory-bound models like large language models (LLMs), where memory access speed is a major performance bottleneck. For more information, we’ve covered quantization in detail in a previous blog post.

That post also compared Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT):

PTQ applies quantization after training is complete. It’s simple and quick to implement, but often comes with greater accuracy loss.
QAT, on the other hand, simulates quantization during additional training so the model can learn to adapt to quantization noise, leading to better chance of more robust and accurate output.

The 1.58-bit quantization we’ll discuss today is an extreme low-bit quantization technique that approximates weights using only three discrete values: -1, 0, and 1.

Unlike typical int8 (256 values) or int4 (16 values) quantization schemes, 1.58-bit quantization restricts the representation to just three values. This enables maximum model compression and also allows inference operations to be replaced by simpler sign-based computations — a major advantage in hardware implementation.

The term “1.58-bit” doesn’t refer to literal bit width but rather stems from information theory: log₂(3) ≈ 1.58

This represents the theoretical number of bits needed to express 3 values. In actual implementation, each weight is approximated as one of -1, 0, or 1. The following formula is used in this case:

To make this technique usable in real Edge AI applications, we applied Quantization-Aware Training (QAT) to Whisper, a Transformer-based speech recognition model developed by OpenAI.

Our experiments showed that QAT is significantly more effective than PTQ in this extreme quantization regime. During training, QAT simulates quantization operations, allowing the model to learn how to tolerate and compensate for the effects of quantization — particularly when representation is severely limited, as in the case of 1.58-bit.

Whisper is available in multiple sizes, as shown below:

Source: https://huggingface.co/openai/whisper-large-v3

When we applied 2-bit PTQ to various Whisper models, we evaluated their Word Error Rate (WER, %), a metric that measures how accurately words are recognized — a single misrecognized word counts as an error; lower WER indicates better performance.

In cases where cells are left blank, the models produced infinitely repeated phrases or failed completely — a clear sign of performance degradation. This degradation occurs because low-precision quantized weights fail to preserve the original weight distribution, which pushes the model far from optimal solutions in weight space.

In contrast, QAT trains the model while simulating the quantization process, allowing it to adjust parameters accordingly. This compensates for quantization errors and significantly preserves accuracy.

In extreme cases like 1.58-bit quantization, where representation is severely limited, we found that only QAT can preserve acceptable performance levels — PTQ alone is insufficient. In line with this, the 1.58-bit Whisper model we introduce in this post also takes this approach, making it feasible for real-world Edge AI applications.

For our 1.58-bit Whisper (Small) model, we performed QAT using approximately 40,000 hours of speech data. The datasets included public sources such as LibriSpeech and Common Voice.

Given the large volume of data, we needed considerable training infrastructure — and fortunately, we received timely support through the Google for Startups: AI First program, which provided both Google Cloud credits and technical assistance.

2025년 '구글 포 스타트업 액셀러레이터: AI 퍼스트' 데모데이 현장을 소개합니다!

Thanks to this support, we were able to conduct QAT using multiple NVIDIA H100 instances and tens of terabytes of SSD storage. In total, the cloud credits we used over about three weeks were worth around 150 million KRW (~$115,000 USD). Once again, I’d like to sincerely thank the Google team for making this dreamlike experience possible 🙇‍♂️👏 — You guys are the best!

While we used about 40,000 hours of open data for QAT, the original Whisper model was reportedly trained on 680,000 hours of paired speech-text data. If we had access to that much training data, our quantized model could likely have achieved even higher performance and robustness.

So — if anyone from OpenAI happens to be reading this, please don’t hesitate to reach out. We’d love to collaborate! 🙏

Implementing 1.58-bit Custom Kernels

Successfully completed QAT for your 1.58-bit model? Congratulations! You’re just halfway there 🫠 Now comes the equally critical step: implementing a dedicated inference kernel that can actually run the model on edge devices.

Common frameworks like PyTorch and TensorFlow are optimized for training and require extensive third-party dependencies and complex runtimes. They consume significant memory and are typically unsuitable or highly inefficient on embedded or edge devices where Python environments are not supported.

In real-world deployments, inference-specific engines or backend libraries are typically used. Examples include TensorFlow Lite (see our previous blog post), or popular lightweight C++ implementations like llama.cpp and whisper.cpp, especially for language models.

The problem? None of these widely-used engines support custom low precisions like 1.58-bit. So unless you wait for existing frameworks to someday support sub-4-bit inference, the only option is to build custom kernels yourself. Fortunately, at ENERZAi, we’ve already developed our own inference engine Optimium and domain-specific programming language Nadya — and we used them to tackle this exact challenge.

Optimium takes models trained in PyTorch or TFLite as input, performs operator fusion and graph-level optimizations, and generates a deployment-ready model customized for the target device.

It works in tandem with Nadya, a language we developed specifically for inference optimization. Nadya is based on MLIR (Multi-Level Intermediate Representation) and compiles models into .so shared libraries after a series of optimization passes.

Nadya’s metaprogramming capabilities dynamically generate code based on actual performance profiling results from the target hardware, ensuring optimal runtime efficiency. For deeper details, check out our previous blog series on Optimium and Nadya.

Even among devices using the same CPU architecture (e.g., Arm Cortex-A73), manufacturers may differ in SIMD instruction set support (e.g., NEON, fp16), memory bandwidth, or cache architecture. Manually writing and tuning kernels for all these hardware variations is no longer practical.

In contrast, Optimium uses Nadya’s abstraction and auto-code-generation system to port the same kernel across diverse hardware environments. Internally, it employs explore–exploit trade-off algorithms to automatically select the best-performing kernel among thousands of candidates.

This is especially important for 1.58-bit kernels, whose computation patterns differ significantly from standard GEMM operations. Therefore, a custom kernel is essential.

Using Optimium, we defined and compiled a 1.58-bit kernel tailored to our QAT-optimized Whisper model — producing highly efficient, deployable results. Kudos to Optimium team! 🥂

Results

Among the various Whisper variants, the Whisper Small model offers a good balance between performance and efficiency. It’s considered the upper bound of what can realistically run on edge devices for speech recognition tasks. Accordingly, the 1.58-bit quantization introduced in this post was applied to the Small model.

The benchmark results we’ll share below were measured on our valued partner Synaptics’ Astra SL1680 board, which features a quad-core Arm Cortex-A73 processor. We ran the benchmark on this CPU platform.

For comparison, we evaluated three versions of the Whisper Small model:

A float16 baseline model
A 4-bit PTQ model
Our custom 1.58-bit QAT model

The float16 & 4-bit baseline ran using whisper.cpp, the most widely used backend implementation for on-device Whisper inference.

We evaluated model accuracy based on Word Error Rate (WER) using the LibriSpeech dataset. Our 1.58-bit QAT model showed only about 0.3% WER degradation compared to the float16 baseline, demonstrating that aggressive quantization via QAT can maintain high accuracy while greatly reducing model size.

We also profiled inference performance for each quantized model using a 9-second audio input:

Peak memory usage was measured using the time command on Linux.
Inference latency was measured using the C++ chrono library.

The results were clear:

The 1.58-bit QAT model reduced peak memory usage by up to 4x compared to float16.
Latency was cut nearly in half.
While the 4-bit PTQ model also achieved some acceleration, it still consumed more than twice the memory of the 1.58-bit model.

In Edge AI environments, it’s not just a single AI workload running — devices typically juggle multiple tasks in parallel. And most edge devices have less than a few GB of total memory. So optimizations like these have direct impact on system stability and user experience.

Memory and speed aside, another crucial consideration for edge deployment is power consumption. We conducted a power comparison under the same conditions. On the SL1680 Cortex-A73 platform, when processing 12 seconds of speech:

The 16-bit baseline model running on whisper.cpp consumed 0.0213Wh
The 1.58-bit model running on Optimium consumed only 0.0088Wh

That’s roughly 40% of the original power consumption — for the same output.

🎥 Seeing is Believing

Here’s a demo of our 1.58-bit Whisper model in action:

This video was recorded at the 2025 Embedded Vision Summit in Santa Clara this May, where we live-demoed the model discussed in this post. Many visitors told us they’d heard about 1.58-bit models, but had never seen one actually running — and they were genuinely impressed.

At that same event, our inference engine Optimium was honored with the 2025 “Product of the Year” Award — a proud moment for our team!

The first day of the 2025 Embedded Vision Summit main program was one for the books

To add one more proud moment: in June, we received the Best Industrial Technology Paper Award (Grand Prize) during the Summer Conference of the Institute of Electronics and Information Engineers, for our work on 1.58-bit QAT and kernel implementation.

Conclusion & What’s Next

Whisper and other speech recognition models are central to Edge AI interfaces. If you connect them with Natural Language Understanding (NLU) models, you can build voice control systems for appliances like washing machines or air conditioners. Add a Text-to-Speech (TTS) model on top, and you get a voice assistant pipeline that can guide users through device usage conversationally. The more lightweight and optimized each model in this pipeline is, the more viable and efficient the system will become — especially at the edge.

We’re already working on multiple models — not just Whisper — using extreme low-bit quantization techniques for various projects. In our next post, we’ll showcase how we deployed 1.58-bit Whisper + NLU models on a Raspberry Pi 5 CPU to build a voice-controlled smart lighting application.

Stay tuned as we’ll continue to support our customers by delivering extremely low-bit models that enable the best AI experiences on devices with minimal memory and energy budgets.

For collaboration or inquiries, feel free to reach out to us anytime!
📧 hanhim.chang@enerzai.com

Optimium

Solutions

Company

Resources

ENERZAi