Technology
Today, we’ll share results where we re-trained the original Whisper for optimal Korean ASR(Automatic Speech Recognition), applied Post-Training Quantization (PTQ), and provided a richer Pareto analysis so customers with different constraints and requirements can pick exactly what fits their needs — including results with 1.58-bit QAT.
Hanhim Chang
2025년 9월 26일
Hello, this is Daniel Changfrom ENERZAi! I’m back on the ENERZAi blog with — yes — another story about Whisper. You might ask, “Whisper again?” but there’s a reason: Whisper is such a powerful model for speech recognition that there’s always more to talk about. 🫠
In our previous post, we took a close look at extreme model compression (1.58-bit) and implementing custom kernels with Optimum. The recognition accuracy we shared for Quantization-Aware Training (QAT) was evaluated on English.
Today, by contrast, we’ll share results where we re-trained the original Whisper for optimal Korean ASR(Automatic Speech Recognition), applied Post-Training Quantization (PTQ), and provided a richer Pareto analysis so customers with different constraints and requirements can pick exactly what fits their needs — including results with 1.58-bit QAT.
TL;DR
The original Whisper was trained on 680K hours of audio–text pairs, of which roughly 83% are English-based. As a result, for non-English languages (including Korean), accuracy often lags behind English.
ENERZAi re-trained various Whisper models — Small, Base, Tiny — using 50K hours (38M pairs) of Korean audio–text data. We obtained a 13 MB model that outperforms the 3 GB Whisper Large (whisper-large-v3) on Korean ASR. In other words, we achieved better accuracy with a model that’s about 0.4% of the size. 😎
We laid out models along the accuracy–size trade-off curve starting from 13 MB. Our largest re-trained model (484 MB) halved the Character Error Rate (CER) on Korean compared to the 3 GB Whisper Large.
Beyond re-training the model with augmented dataset, we built a custom tokenizer for Korean and applied several techniques that significantly improved accuracy.
ENERZAi analyzes Pareto-optimal trade-offs among power, memory, and performance across large models to help customers adopt the best-fit configuration. If you need a model tailored for your application, don’t hesitate to contact us!
About Whisper
Whisper is a multilingual, multitask ASR model (transcription and translation — translation target is English only) open-sourced by OpenAI in 2022. We covered Whisper in detail in a previous post 👇

Whisper was trained on ~680K hours of audio–text pairs, of which non-English (non-English audio with non-English transcripts) accounts for only 17% (117,113 hours). Given that those non-English hours span 98 languages, the Korean portion is a tiny fraction of the whole.
In practice, you’ll see accuracy drop noticeably on Korean compared to English. Internally, we tested whisper-large-v3 and observed CER 3.91% on English (Test Dataset: LibriSpeech Other), but when applied to Korean (Test Dataset: KsponSpeech Eval-Other), accuracy dropped sharply to CER 11.13%.
EZWhisper KR: Accurate yet Lightweight Korean ASR
As noted above, we succeeded in implementing a Korean Whisper that outperforms the original on Korean while being lightweight and fast enough for on-device environments. Below, we describe how we boosted accuracy for Korean and optimized memory & speed.
Dataset
Although Whisper was trained on a massive dataset, the portion of Korean used for training is quite small. Our first step to improve Korean performance was continued learning with Korean datasets.
We performed additional training with ~50K hours of Korean open-source audio–text pairs, drawing from a diverse set of domains — casual conversation, broadcast, medical, contact-center, and more — so the model would perform well across various scenarios. We also added dialectal speech to strengthen robustness.
Another key challenge: Korean often diverges between spelling and pronunciation. Some Korean ASR datasets include both orthographic (spelling) transcripts and phonetic transcripts. For example, the orthographic “3시에 TV를 봐야지” would have a phonetic rendering like “세 시에 티비를 봐야지.”
Either style can work, but inconsistency across datasets hurts training. We therefore normalized everything to a single style via a Korean Normalizer. From a model perspective, phonetic transcripts can be easier to align with audio, but for usability we chose orthographic transcripts. The original Whisper was also trained primarily with orthographic text.
Tokenizer
A tokenizer converts text to tokens the model can process (and back to text). During Whisper training, it is used to convert the ground-truth text to tokens, training the decoder.
Because the original Whisper is multilingual, its tokenizer is not Korean-specialized. We hypothesized that the tokenizer caused the performance drop for Korean, so we built a Korean-centric tokenizer and retrained the model. Results were strong: on the same Korean test dataset, CER of Whisper-Small model dropped from 18.05% to 6.45% after additional training and tokenizer customization — an 11%+ reduction.
Quantization
As we have shared several times in previous posts, we quantize the Whisper model to a very low 1.58-bit configuration and provide it in an efficient form that can run in on-device environments. Group size is one of the factors in quantization that significantly affects both model performance and model size. The Korean Whisper model introduced in this post sets the group size equal to the channel size — a choice we arrived at through sustained experimentation to meet Pareto optimality across accuracy, model size, memory, and speed.
Quantization is the process of converting a model’s floating-point weights to integers. To map widely distributed real values to integers, we need a scale and a zero point. The scale indicates how spread out the distribution is, and the zero point indicates how much the distribution is shifted relative to zero. However, a single scale and single zero point cannot adequately represent an entire weight distribution. In fact, if you quantize a 3×3 weight matrix like the one below using a single scale and zero point, the de-quantized weights will differ substantially from the original.
Press enter or click to view image in full size

This quantization error grows as the number of weights increases. To mitigate it, it is common to split the weights into multiple groups and assign a scale and zero point per group. Consequently, the smaller each group is (i.e., the more groups you have), the lower the quantization error becomes; when the group size is 1, the result matches the original floating-point weights. As in the example below, if you divide the same 3×3 weight matrix into three groups and compute a separate scale and zero point for each, the error after de-quantization is greatly reduced relative to the original.

That said, making groups smaller and increasing their count is not always good. Because the parameters of a quantized model also include the floating-point scale and zero-point values, smaller group sizes increase the number of these parameters, thereby inflating the model size. At the same time, the number of multiply-add operations involving scale and zero point grows with the number of groups, which negatively impacts speed.
Therefore, finding an appropriate group size that balances accuracy, speed, and memory is crucial. Through sustained experiments, we confirmed that setting the group size equal to the channel size minimizes accuracy loss while achieving excellent memory efficiency and speed.
Sample Outputs
Using the approaches above, we built a Korean ASR model that significantly outperforms the original on Korean. Below are sample comparisons between the original whisper-large-v3 (16-bit) and ENERZAi’s EZWhisper-small (1.58-bit) on the same audio. Despite being over 40× smaller (70MB vs 3GB), EZWhisper produces clearly superior transcripts.
Sample 1
Ground-truth: 일요일 날 알바를 간다고? 그래서 응
OpenAI Whipser-large-v3: 이러면 알바를 간다고? 그래서 응.
ENERZAi EZWhisper-small: 일요일 날 알바를 간다고? 그래서 응.
Sample 2
Ground-truth: 일요일 날 알바를 간다고? 그래서 응
OpenAI Whipser-large-v3: 이러면 알바를 간다고? 그래서 응.
ENERZAi EZWhisper-small: 일요일 날 알바를 간다고? 그래서 응.
Sample 3
Ground-truth: 일요일 날 알바를 간다고? 그래서 응
OpenAI Whipser-large-v3: 이러면 알바를 간다고? 그래서 응.
ENERZAi EZWhisper-small: 일요일 날 알바를 간다고? 그래서 응.
Today, we introduced ENERZAi’s EZWhisper KR model, which delivers overwhelmingly superior Korean speech recognition performance compared to the original. ENERZAi is conducting R&D in multiple directions — such as integrating a Natural Language Understanding (NLU) model with EZWhisper KR to build voice control solutions for washing machines and air conditioners, and chaining a language model with a Text-to-Speech (TTS) model to implement a full voice assistant pipeline.
As mentioned above, ENERZAi is a team that analyzes Pareto-optimal trade-offs among performance, memory, power, and speed for a variety of large models — including Whisper — in order to deliver custom, best-fit solutions that precisely meet each customer’s requirements.
Going forward, we will expand our coverage to a wider range of models — including LLMs and VLMs — to provide on-device AI solutions that achieve top-tier performance with minimal memory and power across more domains. If you have any questions or inquiries, please feel free to contact us at any time:
hanhim.chang@enerzai.com