Product

##### Hi, everyone! We’ve been uploading posts on various topics, ranging from conceptual introductions on AI optimization, On-device AI to more in-depth approach on well-known inference optimization frameworks such as TFLite.

Sungmin Woo

March 24, 2024

Hi, everyone! We’ve been uploading posts on various topics, ranging from conceptual introductions on AI optimization, On-device AI to more in-depth approach on well-known inference optimization frameworks such as TFLite. All these posts were dealing with the fundamentals to understand the trend of Edge AI market and insights which our R&D team have gained through years of research on model inference optimization.

In this post we will share how our AI model inference optimization engine, ‘Optimium’, can maximize the inference speed within the target hardware while preserving the accuracy of AI models.

We’ll be sharing some examples codes below which are composed of simple structured Python code for your better understanding. Hope this article could guide you through what exactly AI inference optimization is and why Optimium could be the best tool to make this happen 💫

### Why is inference optimization necessary?

Inference optimization refers to the broad concept of all processes aimed at improving the performance of AI model inference within the target hardware environment. However, its essence lies in implementing codes that perform the necessary operations as quickly and efficiently as possible.

The importance of inference optimization stems from the fact that **even the same operations can exhibit significant differences in inference speed depending on how they are implemented**.

We’ll illustrate this with a simple example. Below are two pieces of code to calculate Z=|X+Y| when given matrices X and Y:

<code 1>

<code 2>

The difference between the two codes can be summarized as below.

[code 1] : Consists of two loops; one calculating X+Y and one taking the absolute value.

[code 2] : Performs X+Y and absolute value in a single loop.

The results of both codes are exactly the same. However, while [code 1] takes **35 seconds** to execute, [code 2] can calculate Z=|X+Y| within **21 seconds** (on AMD Ryzen9 7950x).

**Simply combining one loop can accelerate the inference speed by more than 1.6 times!**

*💡 <Advanced>*

*In general, the inference speed of AI models is determined by Computing Speed, which indicates how fast operations are performed, and Memory Access Speed, which indicates how fast data is stored and retrieved from memory. This example will be the case in which we’ve achieved faster inference time by reducing the time wasted on accessing memory to store and retrieve information.*

*Moreover, storing information in memory is called Store, and retrieving necessary information from memory is called Load. Based on the number of accesses to Z, [code 1] requires 2 stores and 1 load, which in total requires 3 memory accesses, while [code 2] can perform operations with only 1 memory access for storing.*

Through above, we’ve explored how an optimized code can contribute to improving the inference speed of AI models. Now, let’s dive into more details on some of the major optimization techniques applied in Optimium and how Optimium performs inference optimization

### Optimization in Optimium

##### SIMD(Single Instruction Multiple Data)

SIMD stands for Single Instruction Multiple Data, which is an optimization technique that significantly reduces the number of required operations compared to the SISD (Single Instruction Single Data) technique by processing multiple data at once with a single instruction, contributing to improved inference speed. The figure below illustrates the number of operations required for SISD and SIMD to perform the same task of vector addition with 8 numbers.

In SISD (left figure), since only one data can be processed at a time, a total of **8 additions** need to be performed. However, in SIMD (right figure), as it can process 4 data at once, the computation can be completed with **only 2 additions**. Of course, the resulting values are the same! (Although there may be slight differences)

The Python numpy code below may help you understand better. Both codes perform the operation of 2D convolution. The code in [code 3] performs multiplication and addition for **each component**, while [code 4] **reads 4 consecutive values and performs SIMD operation at once**, resulting in **much faster 2D convolution** with fewer operations. Those familiar with Numpy might already frequently use code similar to [code 4]. (You’ve already been using SIMD without knowing it!)

<code 3>

<code 4>

However, SIMD cannot be used in all situations. It is applicable only when the data is **contiguous** in memory. This means that it’s possible to divide the [0,16] interval into 4 continuous intervals like [0,4][4,8][8,12][12,16] to reduce the number of required operations through SIMD, but it is impossible to apply SIMD to discontinuous data like (0,1,5,6)(2,3,7,8).

Furthermore, the amount of data that can be processed at once & SIMD features (Instruction Set) vary by hardware, it is crucial to apply SIMD techniques in a form suitable for the target hardware environment. Most commonly known SIMD architectures for major hardware architectures are as follows.

x86_64: SSE(128bit), AVX256, AVX512

ARM: Neon(128bit)

##### Unroll(Loop Unrolling)

Unroll refers to an optimization techniques which **expands loops within the code** to reduce overhead and the number of memory accesses, thereby enhancing inference speed. For a detailed explanation of how unrolling improves inference speed, please refer to the following.

**Loop Overhead Reduction**

Although Loop is a very convenient feature, they incur overhead such as condition checks, synchronization, and index increments each time they move to the next loop. Unrolling reduces the number of loops and thus reduces the time wasted on overhead.

**Memory Access Reduction**

As mentioned above, the overall inference time is not only affected by computation speed but also significantly influenced by the time it takes to read and write from the memory. Therefore,

**reducing the number of memory accesses can lead to inference speed acceleration.**The example below compares the number of memory accesses for a 2x2 kernel 2D convolution (it’s in channel=out channel=1, but it’s not significant). [Figure 2] shows the original code without unrolling, while [Figure 3] shows the code unrolled by 2.

The area where input needs to be accessed in each loop iteration are marked in dark blue. Therefore, when i=0, j=0, the top-left 2x2 area needs to be read, and when i=0, j=1, it needs to move one block to the right to read the corresponding area.

As the loop iterates, most of the input needs to be read 4 times each.

In [Figure 3], the loop over j is unrolled by 2, adding 4 lines to the loop body, which increased the number of inputs to be read at once. As

`input[i,j+1,0]`

and`input[i+1,j+1]`

each appear twice in the loop, the range of inputs accessed in each loop is not 2x4 but 2x3.

Therefore, the area accessed in each loop are the top-left 2x3 area when i=0, j=0, and the 2x3 area moved two cells to the right when i=0, j=2.

To summarize above, the original code in [Figure 2] requires

**64 memory accesses**, whereas the unrolled code in [Figure 3] requires**only 48 memory accesses**to inference.

*💡 <Advanced>*

*Of course, Unroll is no panacea! Excessive unrolling can cause register shortage which could eventually lead to register spilling. Given that Register spilling could cause a drastic decrease in inference speed, it is very important to find the optimal unroll number. Although finding this ‘optimal unroll number’ could be very challenging, Optimium can find it by its ‘Auto-tuning’ capability. We will discuss more about this ‘Auto-tuning’ function in our following posts.*

So far, we’ve explored two prominent optimization techniques applied in Optimium: SIMD and Unroll. We’ll be also sharing other various optimization techniques beyond SIMD and Unroll in a near future, so please stayed tuned!

### Why Optimium?

SIMD and Unroll techniques could have been a cliché for those who were already interested in AI inference optimization or code optimization. However, successfully implementing these optimization techniques within an inference optimization engine is a highly challenging task for now. This is because optimized code could **vary depending on number of factors including but not limited to algorithm types, CPU specifications, and loop parameters.** For a detailed explanation of the factors influencing the inference optimization process, please refer below.

**Algorithm Types: **Even for the same operation, different algorithms can be used, leading to different implementations. Even for convolution operation, there are at least four different algorithms:

Winograd algorithm

FFT based algorithm

Matmul algorithm

Indirect buffer based GEMM algorithm

**CPU Specifications: **Same algorithm could perform differently depending on hardware specifications such as:

Cache size

Instruction set

Memory bandwidth

**Loop Parameter: **Loop parameters such as below could also affect the inference performance:

Unroll number

Vector size

Tile size

**Considering all these variables when manually optimizing codes for implementation is nearly impossible**, so most of the inference engines implement manually optimized codes which are only tailored to specific scenarios (specific algorithm types, CPU Specifications, heuristically selected loop parameters, etc.)

Below are images of the GitHub source code lists of open-source inference engines XNNPACK(most famous inference engine which is used as a backend of TensorFlow, TensorFlow Lite, PyTorch, etc.) and Tencent NCNN. They’ve implemented over 20 source codes to support Transpose and Convolution operations. This is because optimized code needs to be implemented manually one by one based on various factors such as 1) whether the algorithm involves floating-point or integer operations, 2) whether Neon instructions are used, and 3) how the unroll number is set.

The problem is that even with over 20 source codes, it **still remains suboptimal**. This is because the optimal code will change as factors such as input shape, kernel size, stride, and CPU specifications change. Would it be possible to manually optimize codes for each and every scenarios possible? **Absolutely not!**

**However, Optimium is different.** We’ve developed our own programming language called** ‘Nadya’** to build Optimium and **thanks to the fact that Nadya is a metaprogramming language**, Optimium can provide automatic inference optimization **covering much broader area compared to conventional inference engines. **For those who are not familiar with the concept of metaprogramming, think of it as a code that generates another code automatically. (We’ll, soon discuss about Nadya and Metaprogramming more deeply in our following posts!)

The image below illustrates the process of Optimium implementing the optimal code to perform c=a+b operation in various scenarios. When users write code (refer to the code at the top of [Figure 5]), Optimium compiles it and generates optimized codes to be executed for all scenarios including but not limited to Unroll number=2 (bottom left) and Unroll number=3 (bottom right), etc.

### Optimium!

Optimium, which has been developed by synthesizing various inference optimization techniques, is currently undergoing beta testing. It has garnered significant attention from various companies due to its superior performance compared to alternative inference engines across diverse hardware environments. While intense research and development are still ongoing, Optimium already boasts remarkable performance, exhibiting over 1.5 times faster inference speed compared to XNNPack.

For those who are interested in exploring more diverse performance results of Optimium in various environments, please refer to

Optimium covers way more optimization techniqies than those illustrate above. It leverages other technologies such as Operation fusion, Mixed precision quantization, and more to make the best out of every AI models! We’ll cover these on our following posts, so please stay tuned and meanwhile don’t forget to sign-up via below link if you’re interested in trying out our Optimium Beta.