Optimium 101 (3): Optimium utilized Operator Fusion! The attack was super effective! 👍

Optimium

Solutions

Company

Resources

Contact

Select Language

Optimium

Solutions

Company

Resources

Contact

Select Language

Product

Optimium 101 (3): Optimium utilized Operator Fusion! The attack was super effective! 👍

Today, I will look at Operator Fusion, optimizing and accelerating the overall computation of AI models.

Sewoong Moh

2024년 4월 22일

Hello everyone. This is Sewoong Moh from the Optimium team, who is developing an AI inference optimization engine. As mentioned in the previous post, Optimium is a powerful inference engine for AI developers that supports the easy and fast deployment of optimal AI models. As you may know, Optimium uses various methods to provide users with faster inference speeds. Today, I will look at Operator Fusion, optimizing and accelerating the overall computation of AI models.

What is Operator Fusion?

Operator Fusion refers to the merging of different layers (or operators) into a single layer. We can understand the benefits of merging operations from different layers into one through examples. The following is an example where an Add operation is performed after Mul.

def Mul(a, b):
 c = zeros_like(a)
 for i0 in range(a.shape[0]):
  for i1 in range(a.shape[1]):
   ...
    for ik in range(a.shape[k]):
     v1 = a[i0, i1, ..., ik]
     v2 = b[i0, i1, ..., ik]
     acc = v1 * v2
     c[i0, i1, ..., ik] = acc
 return c

def Add(a, b):
 c = zeros_like(a)
 for i0 in range(a.shape[0]):
  for i1 in range(a.shape[1]):
   ...
    for ik in range(a.shape[k]):
     v1 = a[i0, i1, ..., ik]
     v2 = b[i0, i1, ..., ik]
     acc = v1 + v2
     c[i0, i1, ..., ik] = acc
 return c
 
Add(Mul(x, y), z)

In the scenario where fusion is not applied as described above, assuming that the shapes of each Tensor(a,b,c) are n*n*n*n, operations would iterate a total of 2*n⁴ times. In Addition, access to input and output would occur2*(2*n⁴+n^4)=6*n⁴times. What would happen if Mul and Addwere fused?

def Fused_Mul_Add(a, b, c):
 d = zeros_like(a)
 for i0 in range(a.shape[0]):
  for i1 in range(a.shape[1]):
   ...
    for ik in range(a.shape[k]):
     v1 = a[i0, i1, ..., ik]
     v2 = b[i0, i1, ..., ik]
     v3 = c[i0, i1, ..., ik]
     acc = v1 * v2
     acc = acc + v3
     d[i0, i1, ..., ik] = acc
 return d
 
Fused_Mul_Add(x, y, z)

Additionally, data access becomes faster because values are already loaded into registers or caches. Moreover, the a*b+c computational pattern offers the possibility of optimization using FMA instructions by the compiler, further enhancing speed. These examples illustrate the advantages of Operator Fusion.

Reduced loop iterations
Decreased memory access frequency
Provided additional optimization opportunities

💡 <FMA Instruction>

This is an instruction provided by processor architectures to perform the calculation of a * b + c in a single operation instead of two separate operations, one for multiplication and one for addition.
By reducing the number of operations, it enhances the speed and improves calculation accuracy by minimizing rounding errors.
Additionally, most architectures also provide SIMD instructions for FMA, allowing for maximum performance gains when used.

Challenges of Fusion

The advantages of Operator Fusion are undeniable. As a result, other inference engines such as TFLite, XNNPACK and OpenVINO already use Operator Fusion. However, unlike Optimium, which supports metaprogramming, these engines are limited to performing only basic Operator Fusion due to their fundamental requirement of directly implementing all operator source code. For example, XNNPACK only supports a limited set of activations such as Convolution + ReLU and ReLU6 due to this inherent limitation. Below I will explain the specific reasons for this unavoidable limitation.

💡 <Summary of “Challenges in Fusion” for those short on time>

Implementing fused layers one at a time cannot cover different combinations of fusible operations and makes it difficult for developers to manage the system, leading to poor maintainability.
Implementing fusion at runtime is inefficient.

Fusion involves various combinations of operations, depending on the layer patterns that appear in the model architecture. The most common form is when activations are fused after Convolution or Fully-Connected Layers. Additionally, there are often cases where Element-wise Binary Operations (such as Add, Mul, Sub, Div, etc.) appear in sequence.

Not only the types of operations used in fusion, but also the number and order of operations, can vary. Fusion is not limited to just two operations. The number of operations fused can also be three or four, as illustrated below.

Writing code in advance for all possible combinations of different types of operations, their combinations and sequences is simply impossible. Therefore, most inference engines restrict the types of fusion allowed, allowing fusion only for a few specific combinations.

Second, implementing fusion for different combinations of operations is challenging, but managing the code once it’s implemented is equally daunting. Fusion involves different combinations of operations, resulting in implementations of repeated layers spread across multiple fused layers (as shown in the figure below with Add). If a problem is discovered in the implementation of an operation such as Add and requires modification, it becomes necessary to locate each occurrence of Add across the various combinations and make individual modifications. Such challenges can compromise the stability of future operations and make package management cumbersome.

To avoid the complexity of composing fusion layers with different combinations, there’s a method of performing fusion dynamically at runtime, based on the situation in which layers perform operations. However, this approach introduces a performance overhead. Let’s consider an example of performing fusion after a Conv2D layer. During the execution of a layer, one can dynamically check for fusion opportunities after Conv2D using conditional statements (if), and perform the appropriate operation based on the type of layer. If there are multiple layers eligible for fusion, a loop (for) would be required to iterate through them and perform operations accordingly. However, combining conditional statements and loops at runtime is highly inefficient and computationally expensive. Performing fusion in this way can result in a loss of performance gains from fusion due to the overhead introduced by inefficient conditional statements and loops.

Fusion and Nadya

To overcome the above limitations and achieve maximum performance, Optimium uses a proprietary language called Nadya, which supports Metaprogramming, to implement layers. The process can be summarised as follows.

At the Code Geneartion stage, information about the model, in particular the details of each layer, is passed to Nadya modules, which are responsible for generating code for each layer. Each Nadya module provides optimised code that can be used at runtime based on this information. This process of generating runtime code using code generation is called Metaprogramming.

In Optimium, we developed a Fusion Module exploiting Nadya, which allows for a more flexible use of fusion functionalities than the previous limited capabilities. Broadly, Fusion in Optimium aims to achieve the following objectives:

To implement fused layers one by one cannot cover various combinations of fusible operations and makes it difficult for developers to manage the system, leading to poor maintainability.
→ Near-unlimited fusion layer support through Metaprogramming.
To implement fusion in runtime is inefficient
→ Performing fusion during code generation to ensure efficient operations at runtime.

We divide fusion by two based on the relative positions of layers in the connected graph of layers. For example, in a sequence Add-Conv-Mul-ReLU, considering the convolution (Conv) as the most computationally complex operation, the fusion can be divided into Pre-Fusion with operation Add before Conv, and Post-Fusion with operations Mul-ReLU after Conv.

Add (Pre-Fusion)
  |
Conv 
  |
Mul (Post-Fusion)
  |
ReLu

In Optimium, during the code generation phase, Fusion information divided as described above is passed from the Graph module to the Layer and finally to the Fusion Module. Based on this information, the Fusion Module provides each layer’s runtime code which is integrated in fused layer’s runtime code at last.

Let us consider an example of Mul-Add-ReLU. In Optimium’s Graph Module, when generating code for the MulLayer, information about the MulLayer is provided, as well as information about fusion with Add-ReLU. During the Mul Layer code generation process, fusion information is passed to the Fusion Module. Using this information, the Fusion Module generates code for the fused layer, which is then passed back to the Mul Layer. During code generation, the Mul Layer integrates the code for the fused layer. This process allows Optimium to efficiently generate run-time code for the fused Mul-Add-ReLU.

More generally, the Graph Module can generate Fusion information for arbitrary combinations of layers. Based on this information, when generating code for each layer, the Fusion Module integrates the code generated for fused layers, effectively adapting to different layers and Fusion combinations. Implementation for fused layers only needs to be done in the Fusion Module, eliminating the need to implement it for each individual layer. This means that any future changes only need to be made in the Fusion Module. This solves the first problem with Fusion, where it was impractical to accommodate numerous fusion combinations and maintainability was a challenge. In addition, the fusion module generates fused code during code generation rather than at runtime. As a result, the actual runtime code remains efficient without inefficient conditional statements, allowing efficient execution. This solves the second problem of inefficient fusion performance at runtime.

Why Optimium?

Exactly! In Optimium, leveraging our in-house language Nadya enables us to harness the benefits of metaprogramming to address the shortcomings of traditional fusion methods. As a result, we can support a broader range of fusion capabilities compared to existing inference engines while still providing efficient runtime performance.

The graph above illustrates the performance increment by applying Fusion in Optimium. As you can see, Fusion accelerates performance by approximately 1.3 times. This improvement is achievable because Optimium supports not only basic Fusion operations like Convolution+ReLU, which are also supported by other inference engines, but also a variety of other operations such as Binary operators and Padding, with Fusion capability extended to even 3–4 operations. While this measurement was taken on a Raspberry Pi 5, similar enhancements can be observed on other devices as well.

Run Optimium , run!🏃

Currently in beta testing, Optimium has attracted the attention of several companies due to its superior performance compared to existing inference optimisation engines across different hardware environments. Designed as a comprehensive collection of different optimisation techniques, not limited to Operator Fusion as described today, Optimium has already demonstrated superior inference speeds compared to widely used engines such as TensorFlow Lite and XNNPACK. If you are interested in experiencing Optimium first hand, please feel free to contact us at any time using the beta sign-up link below.

👉 https://wft8y29gq1z.typeform.com/to/Sv9In4SI