Optimium

Solutions

Company

Resources

Contact

Select Language

Optimium

Solutions

Company

Resources

Contact

Select Language

Insight

Whisper, A Breakthrough in Speech-Recognition AI

Whisper is an automatic speech-recognition (ASR) model released as open source by OpenAI in 2022. It supports multilingual transcription (converting speech to text) and translation (output language fixed to English).

Sungmin Woo

2025년 5월 27일

Hello, this is Sungmin Woo from the Business Development Team at ENERZAi.

As we’ve shared in previous posts, ENERZAi has developed Optimium, a next-generation AI inference-optimization engine that maximizes inference speed on target hardware while preserving model accuracy, and we are now commercializing the solution. In addition to deploying Optimium, we actively collaborate with clients to (1) squeeze every bit of performance from their current AI models and (2) build models that are custom-fit to their applications.

In this post, we’ll look at one of the models we’re researching for optimization: OpenAI’s Whisper, which has attracted attention for its strong performance and broad usability in speech recognition.

What is Whisper?

Whisper is an automatic speech-recognition (ASR) model released as open source by OpenAI in 2022. It supports multilingual transcription (converting speech to text) and translation (output language fixed to English). Whisper delivers robust accuracy even in noisy environments, covers many languages, and handles unusual accents or pronunciations with ease. Most notably, without any additional fine-tuning it can perform not only transcription but also translation, voice-activity detection, and language identification — making it a truly general-purpose model.

Pipeline

Whisper is built on an Encoder–Decoder Transformer architecture. After preprocessing, it converts audio into a latent representation and finally produces the corresponding text. The steps are:

Source: OpenAI, Robust Speech Recognition via Large-Scale Weak Supervision

Preprocessing: Split the input audio into 30-second segments and convert each segment into a log-Mel spectrogram
💡 Mel Spectrogram
: A time–frequency representation that maps the frequency axis to the Mel scale, mirroring human auditory perception.
Encoder: Transform the spectrogram into a sequence of latent vectors that capture temporal (how pronunciations change over time) and spectral (which frequencies are present) characteristics, along with semantic information.
Decoder: Generate text token by token, using both the encoder output and the tokens produced so far to predict the next word or character.

Multilingual & Multitask Model

Whisper is not necessarily the top performer on every language or task, but its special strength lies in handling multiple languages and tasks with a single model. In zero-shot settings — predicting on data never seen during training — it generalizes well across datasets and domains. Two factors make this possible:

Massive, Diverse Training Data

Whisper was trained on 680,000 hours of speech–text pairs, 17 % of which (117,113 hours) are non-English. These non-English samples span 98 languages, so in total Whisper can handle 99 languages.

Source: OpenAI, Robust Speech Recognition via Large-Scale Weak Supervision

Moreover, Whisper is notably resilient to noise and uncommon accents. This stems from its use of weakly-supervised learning. Much of the data came from YouTube and podcasts; many transcripts were machine-generated captions rather than hand-checked subtitles. Such weakly-labeled data inevitably contain errors, yet training on them endowed Whisper with robustness to varied languages, accents, and recording conditions.

To keep noisy labels from degrading accuracy, OpenAI applied several filters:

Language mismatch filter — drop clips where audio and subtitle languages differ.
Partial transcription removal — drop clips with large duration mismatch.
Deduplication — remove repeated audio or captions.
Click-bait text removal — delete captions like “Please like and subscribe”.

Context Tokens

During training and inference, Whisper’s decoder relies on special context tokens to control language and task:

Source: OpenAI, Robust Speech Recognition via Large-Scale Weak Supervision

Start token: <|startoftranscript|>
Language token: indicates the language of the audio (ex. <|en|> for English)
Task token: specifies which task to perform — speech transcription or translation ( <|transcribe|> , <|translate|> )
Timestamp token: determines whether time information should be included in the final output( <|notimestamps|> )

The decoder jointly considers (1) the context tokens, (2) the encoder output, and (3) the tokens already generated, to predict the next token.

Competitiveness: Accuracy & Practicality

Exceptional Generalization

ASR quality is usually measured by Word Error Rate (WER). When Meta’s wav2vec 2.0 and Whisper Large v2 were evaluated on multiple datasets, Whisper showed 55.2 % fewer errors on average. Given that wav2vec 2.0 was fine-tuned on LibriSpeech, Whisper’s similar performance on LibriSpeech plus its clear edge elsewhere highlight the model’s superior generalization.

Source: OpenAI, Robust Speech Recognition via Large-Scale Weak Supervision

Wide Language Coverage — But Uneven Accuracy

Supporting 99 languages with one model is a huge advantage, yet accuracy varies by language. The chart below (Common Voice 15 & FLEURS) shows that some languages transcribe almost flawlessly, while others see higher error rates.

Source: OpenAI, Robust Speech Recognition via Large-Scale Weak Supervision

Languages with little training data or complex structures tend to fare worse.

Various Model Sizes

Whisper is available in six sizes; four also have English-only variants.

Source: https://github.com/openai/whisper

English-only models usually outperform their multilingual counterparts — particularly tiny.en and base.en, which are noticeably stronger than tiny and base. The Large model delivers the best raw accuracy, but at the cost of high memory use and slow inference. The 2024–09 turbo release preserved Large-level accuracy with much faster responses, but it is still sizable, making deployment on resource-constrained devices difficult.

With speech recognition moving into edge devices — robots, smart appliances, wearables — interest in the smaller tiny and base models is growing.

Whisper at ENERZAi

Why can’t edge devices simply use Whisper-small instead of base or tiny? The key constraint is memory.

ENERZAi tackles this by quantizing models down to ≤ 4 bits (we’ve even achieved 1.58 bits) and accelerating the quantized layers with custom Optimium kernels. This makes high-performance models run on-device within tight memory budgets.

In fact, at the Embedded Vision Summit held on May 21–22, we demonstrated running the Whisper model introduced in this article quantized to 1.58 bits, using minimal memory to run on-device on both the Synaptics Astra and Renesas RZ/V2H boards.

On the Synaptics Astra board, we showcased a demo comparing the 16-bit Whisper model running with Whisper.cpp on an Arm Cortex-A73 CPU against the 1.58-bit model running with Optimium. The demo confirmed that the 1.58-bit model used over four times less memory while running at more than twice the speed compared to the 16-bit model.

We sincerely thank Synaptics for their extensive support from preparation to the demo presentation!

https://www.linkedin.com/posts/synaptics_2025embeddedvisionsummit-synaptics-astra-activity-7331362316346671104-6ujX/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAEhdSoMB4nO6S3ykQ4GegDd5qTn-KnXUcRg

On the Renesas RZ/V2H board, we demonstrated a pipeline using the 1.58-bit Whisper Small model as a trigger for YOLOv3 running on the same board. In this setup, voice commands were converted to text by the 1.58-bit Whisper running on a Cortex-A55 CPU, and the resulting text was passed as a command to the YOLO model deployed on the DRP-AI3 accelerator to detect specified objects within the video.

As introduced above, this 1.58-bit Whisper model was created by quantizing the Whisper Small model down to 1.58 bits and implementing optimized kernels with Optimium, enabling successful on-device operation on both Synaptics and Renesas boards. We will share a separate post later explaining how such extreme low-bit quantization to 1.58 bits was achieved and the specific role Optimium played in enabling the model’s successful execution.

Going forward, ENERZAi plans to support lightweight and optimized versions not only of Whisper but also various other models such as LLMs and VLMs. If you have any needs related to reducing memory usage or improving inference speed for on-device deployment of high-performance AI models, please feel free to contact us anytime.

Optimium

Solutions

Company

Resources

ENERZAi

사업자등록번호: 246-86-01405