Product

##### Today, I’d like to introduce another technique utilized by ENERZAI’s state-of-the-art AI inference engine, Optimium: Mixed precision inference.

Jaeyoon Yoo

May 8, 2024

Hello, This is Jaeyoon(Jaeyoon Yoo), CTO of ENERZAi. I’ve introduced various methods for accelerating AI inference so far. Today, I’d like to introduce another technique utilized by ENERZAI’s state-of-the-art AI inference engine, Optimium: **Mixed precision inference**.

Mixed precision refers to a method of adjusting the number of bits used to represent data to achieve a balance between inference speed and accuracy. Let’s start by understanding what “precision” of number means, and then we can go to the next step.

### The way how computers and humans recognize number is different

In our daily lives, we adopt the decimal system, which uses ten digits (0 to 9) to represent numbers. However, computers use the binary system, which uses only two digits (0 and 1) to represent numbers. Therefore, to express numbers in a form understandable by computers, we need to convert decimal numbers into binary numbers.

There are two methods for representing real numbers in computers(i.e., converting real numbers to binary numbers): fixed-point and floating-point representation. Most systems use the floating-point method and the floating-point method represents numbers as follows.

The following is an example of floating-point representation under 32bits

Here, X is referred to as the mantissa (fraction part), and Y is referred to as the exponent. The number of bits allocated for each depends on the format used. In the single precision format, (which is also called FP32 datatype because 32 bits are used for one float number), 23 bits are allocated to the mantissa and 8 bits go to the exponent. In the double precision format, so called FP64 datatype, 52 bits are allocated to the mantissa and 11 bits go to the exponent. Moreover, in recent formats including FP16, BF16, and FP8, the number of bits allocated to each part is reduced to match the respective precision(16 and 8).

So, why are there so many different representation methods namely FP64, FP32, FP16 and BF16? The reason is that each method has its own clear advantages and disadvantages.

**Advantages of using more bits:**

It can represent a wider range of numbers. While 32 bits can represent numbers up to 3.4*10³⁸, 64 bits can represent numbers up to 1.7*10³⁰⁸.

Using more bits means that more precise representation of numbers is possible. The figure below compares the results of representing 0.3952 with FP16 and FP8. When using 8 bits, there’s a maximum error of 0.2, while with 16 bits, the error is much smaller, below 0.005. Additionally, we can see that the precision varies depending on the number of bits allocated to the exponent and mantissa, even when they are represented in the same way.

**Advantages of using fewer bits:**

Intuitively, the data size decreases. Using fewer bits results in a proportional reduction in data storage requirements.

Computational time also decreases. Computer consumes data to perform various operation like multiplication and addition. When the number of bits decreases, the time required for these operations also decreases.

### How are the numbers represented in machine learning?

Most machine learning frameworks, including PyTorch and TensorFlow, utilize 32-bit floating point as the default data type. This contrasts with Python’s use of 64-bit double precision as the default data type. (This choice seems to stem from the fact that 32-bit precision is sufficient to achieve reasonable level of accuracy, the fact that GPU VRAM needs to be conserved, and the fact that FP32 computations are much faster on GPUs than FP64.) Recently, there has been a trend towards using FP16 and FP8 on GPUs as well to support running large AI models like transformers.

However, on the edge device, most hardware are adopting FP32 or FP16 since they still do not support FP8. The challenge is that there’s always a trade-off between accuracy and computational speed. While FP32 offers superior accuracy, its computational speed is not fast enough in most cases. In the case of FP16, it shows poor performance in terms of accuracy even though its computational speed is faster than FP32.

### Can we combine the advantages of both?

That’s where mixed precision inference comes in. Since direct operations (such as addition or multiplication) between FP16 and FP32 are not possible, mixed precision use FP16 for certain layers and FP32 for the rest in deep learning models. (In recent LLMs, mixed precision involves using a mixture of FP8 and FP32 within a single layer to address I/O-bound bottlenecks. I’ll explain this futher in another post.)

So, the remaining issue is: “**How do we choose layers to use FP16**?” Various methodologies have been proposed, but the underlying principle that permeates most of them is as follows:

“**Select layers with the least impact on the final output, i.e., those that are the least sensitive, and then assign them to FP16**.”

As mentioned earlier, the downside of FP16 is its lower accuracy compared to FP32. Most AI model training occurs based on FP32, but errors occur when converting trained FP32 weights to FP16 (recall the error of around 0.005 when representing 0.3952). These errors affect the final output of the AI model. However, change in weights of some layers have little to no effect on the final output , while change in weights of some other layers significantly affect the final output. Therefore, it’s reasonable to convert the former layers to FP16 while retaining FP32 method for the latter layers.

To elaborate further, the influence of weights on the output refers to how much the output changes when the weights of a particular layer change.

In the equation above, L represents the output and w_i represents the weight value of the i-th layer. When this weight changes by δw_i, the output changes by approximately (1/2)*(δw_i)*(H_{w_i})*(δw_i) assuming a second-order approximation and that the weight is at the optimal point. Here, H_{w_i} represents the Hessian, which indicates the second derivative, and the “sensitivity” of the weight to the output is proportional to the magnitude of this Hessian. Strictly speaking, the Hessian is in matrix form since weights are multidimensional and determinant of the Hessian matrix is used to measure sensitivity.

In summary, in mixed precision inference, we **measure the sensitivity of layers using the Hessian and convert the layers with the least sensitivity to FP16**. Since the accuracy of the AI model decreases as more layers are converted to FP16, the goal of mixed precision inference is to convert as many layers as possible to FP16 within a predefined accuracy tolerance.

While the underlying principle remains the same, there are various practical algorithms. Since it is computationally expensive to compute the Hessian precisely, there are various methods to approximate sensitivity, such as using the Hessian vector product or bypass measurement using only first-order approximations. Additionally, when the number of layers becomes hundreds, finding the combination of layers that minimizes accuracy changes is non-polynomial. Therefore, various search algorithms such as binary search and greedy search are used. ENERZAi is in the process of continuous researche and experiment to explore the most effective and practical methods for applying mixed precision inference.

### The Impact of Mixed Precision was Significant

With mixed precision, ENERZAi team has actually achieved a significant AI model acceleration while minimizing accuracy loss. The model in demand is a pose landmark estimation model with around 70 layers, and the following figure illustrates the sensitivity of each layer.

The x-axis represents the index of the layer, which indicates that layers close to 0 are located near the input and those on the right are located near the output. The y-axis represents sensitivity. For the y-axis, you should consider the relative sizes rather than specific numerical values since there is no unit. In the case of this model, layers near the input tend to exhibit higher sensitivity, indicating that it would be effective to convert the layers near the output to FP16. This seems to originiate from the fact that a lot of computations are concentrated in the early layers since the model is a landmark detection model.

Due to the tight tolerance we set for acceptable accuracy changes, only about 40% of the layers were converted to FP16. The result is as follows:

In the figure, the green line represents the landmarks recognized by the original FP32 model, while the blue line represents the landmarks recognized by the model after precision change. The left shows the results using Mixed Precision, while the right side shows the results when all layers are converted to FP16. As you can see from the figure, simply converting all layers to FP16 results in a significant decrease in accuracy, while the accuracy remains almomst unchanged when using Mixed Precision.

The table below quantitatively measures the accuracy of the two models using two accuracy metrics: NME(Normalized Mean Error, lower is better) and PCK(Percentage of Correct Keypoints, higher is better). (The results were measured with the FP32 model’s output set as the ground truth.)

Mixed Precision showed significant improvement in terms of latency as well.

With Mixed Precision, we can observe an approximately **12% acceleration** in computational speed compared to the original FP32 model. While it’s expected that the computational speed of Mixed Precision would be slower than that of a fully FP16 model, in this case, the accuracy loss of fully FP16 model exceeds an acceptable level, making the comparison less meaningful.

Optimium, the cornerstone of ENERZAi’s inference optimization technology, not only incorporates Mixed Precision but also supports various optimization techniques such as fusion and auto-tuning. This enables an acceleration of computational speed by more than 1.5 times ultimately!

### Run Optimium, run!🏃

Optimium is currently undergoing beta testing and has attracted interest from various companies due to its superior performance compared to existing inference optimization engines in various hardware environments. Intense research and development are ongoing, but Optimium has already demonstrated outstanding performance, showcasing notably faster inference speed compared to the existing solutions widely used in the market such as TensorFlow Lite XNNPACK.

We’ll continue to upload posts about the exciting technologies applied in Optimium, so please stay tuned. If you’re interested in participating in the ongoing Optimium Beta Test, please apply through the link below! 👉