Product
Today, I will look at Operator Fusion, optimizing and accelerating the overall computation of AI models.
Sewoong Moh
2024๋ 4์ 22์ผ
Hello everyone. This is Sewoong Moh from the Optimium team, who is developing an AI inference optimization engine. As mentioned in the previous post, Optimium is a powerful inference engine for AI developers that supports the easy and fast deployment of optimal AI models. As you may know, Optimium uses various methods to provide users with faster inference speeds. Today, I will look at Operator Fusion, optimizing and accelerating the overall computation of AI models.
What is Operator Fusion?
Operator Fusion refers to the merging of different layers (or operators) into a single layer. We can understand the benefits of merging operations from different layers into one through examples. The following is an example where an Add
operation is performed after Mul
.
In the scenario where fusion is not applied as described above, assuming that the shapes of each Tensor(a,b,c
) are n*n*n*n
, operations would iterate a total of 2*nโด
times. In Addition, access to input and output would occur2*(2*nโด+n^4)=6*nโด
times. What would happen if Mul
and Add
were fused?
Additionally, data access becomes faster because values are already loaded into registers or caches. Moreover, the a*b+c
computational pattern offers the possibility of optimization using FMA instructions by the compiler, further enhancing speed. These examples illustrate the advantages of Operator Fusion.
Reduced loop iterations
Decreased memory access frequency
Provided additional optimization opportunities
๐ก <FMA Instruction>
This is an instruction provided by processor architectures to perform the calculation of a * b + c in a single operation instead of two separate operations, one for multiplication and one for addition.
By reducing the number of operations, it enhances the speed and improves calculation accuracy by minimizing rounding errors.
Additionally, most architectures also provide SIMD instructions for FMA, allowing for maximum performance gains when used.
Challenges of Fusion
The advantages of Operator Fusion are undeniable. As a result, other inference engines such as TFLite, XNNPACK and OpenVINO already use Operator Fusion. However, unlike Optimium, which supports metaprogramming, these engines are limited to performing only basic Operator Fusion due to their fundamental requirement of directly implementing all operator source code. For example, XNNPACK only supports a limited set of activations such as Convolution + ReLU and ReLU6 due to this inherent limitation. Below I will explain the specific reasons for this unavoidable limitation.
๐ก <Summary of โChallenges in Fusionโ for those short on time>
Implementing fused layers one at a time cannot cover different combinations of fusible operations and makes it difficult for developers to manage the system, leading to poor maintainability.
Implementing fusion at runtime is inefficient.
Fusion involves various combinations of operations, depending on the layer patterns that appear in the model architecture. The most common form is when activations are fused after Convolution or Fully-Connected Layers. Additionally, there are often cases where Element-wise Binary Operations (such as Add, Mul, Sub, Div, etc.) appear in sequence.
Not only the types of operations used in fusion, but also the number and order of operations, can vary. Fusion is not limited to just two operations. The number of operations fused can also be three or four, as illustrated below.

Writing code in advance for all possible combinations of different types of operations, their combinations and sequences is simply impossible. Therefore, most inference engines restrict the types of fusion allowed, allowing fusion only for a few specific combinations.

Second, implementing fusion for different combinations of operations is challenging, but managing the code once itโs implemented is equally daunting. Fusion involves different combinations of operations, resulting in implementations of repeated layers spread across multiple fused layers (as shown in the figure below with Add
). If a problem is discovered in the implementation of an operation such as Add
and requires modification, it becomes necessary to locate each occurrence of Add
across the various combinations and make individual modifications. Such challenges can compromise the stability of future operations and make package management cumbersome.

Second, implementing fusion for different combinations of operations is challenging, but managing the code once itโs implemented is equally daunting. Fusion involves different combinations of operations, resulting in implementations of repeated layers spread across multiple fused layers (as shown in the figure below with Add
). If a problem is discovered in the implementation of an operation such as Add
and requires modification, it becomes necessary to locate each occurrence of Add
across the various combinations and make individual modifications. Such challenges can compromise the stability of future operations and make package management cumbersome.

To avoid the complexity of composing fusion layers with different combinations, thereโs a method of performing fusion dynamically at runtime, based on the situation in which layers perform operations. However, this approach introduces a performance overhead. Letโs consider an example of performing fusion after a Conv2D layer. During the execution of a layer, one can dynamically check for fusion opportunities after Conv2D using conditional statements (if), and perform the appropriate operation based on the type of layer. If there are multiple layers eligible for fusion, a loop (for) would be required to iterate through them and perform operations accordingly. However, combining conditional statements and loops at runtime is highly inefficient and computationally expensive. Performing fusion in this way can result in a loss of performance gains from fusion due to the overhead introduced by inefficient conditional statements and loops.

Fusion and Nadya
To overcome the above limitations and achieve maximum performance, Optimium uses a proprietary language called Nadya, which supports Metaprogramming, to implement layers. The process can be summarised as follows.

At the Code Geneartion
stage, information about the model, in particular the details of each layer, is passed to Nadya modules, which are responsible for generating code for each layer. Each Nadya module provides optimised code that can be used at runtime based on this information. This process of generating runtime code using code generation is called Metaprogramming.
In Optimium, we developed a Fusion Module exploiting Nadya, which allows for a more flexible use of fusion functionalities than the previous limited capabilities. Broadly, Fusion in Optimium aims to achieve the following objectives:
To implement fused layers one by one cannot cover various combinations of fusible operations and makes it difficult for developers to manage the system, leading to poor maintainability.
โ Near-unlimited fusion layer support through Metaprogramming.
To implement fusion in runtime is inefficient
โ Performing fusion during code generation to ensure efficient operations at runtime.
We divide fusion by two based on the relative positions of layers in the connected graph of layers. For example, in a sequence Add-Conv-Mul-ReLU
, considering the convolution (Conv
) as the most computationally complex operation, the fusion can be divided into Pre-Fusion
with operation Add
before Conv
, and Post-Fusion
with operations Mul-ReLU
after Conv
.

In Optimium, during the code generation phase, Fusion information divided as described above is passed from the Graph module to the Layer and finally to the Fusion Module. Based on this information, the Fusion Module provides each layerโs runtime code which is integrated in fused layerโs runtime code at last.
Let us consider an example of Mul-Add-ReLU
. In Optimiumโs Graph Module, when generating code for the Mul
Layer, information about the Mul
Layer is provided, as well as information about fusion with Add-ReLU
. During the Mul
Layer code generation process, fusion information is passed to the Fusion Module. Using this information, the Fusion Module generates code for the fused layer, which is then passed back to the Mul
Layer. During code generation, the Mul
Layer integrates the code for the fused layer. This process allows Optimium to efficiently generate run-time code for the fused Mul-Add-ReLU
.

More generally, the Graph Module can generate Fusion information for arbitrary combinations of layers. Based on this information, when generating code for each layer, the Fusion Module integrates the code generated for fused layers, effectively adapting to different layers and Fusion combinations. Implementation for fused layers only needs to be done in the Fusion Module, eliminating the need to implement it for each individual layer. This means that any future changes only need to be made in the Fusion Module. This solves the first problem with Fusion, where it was impractical to accommodate numerous fusion combinations and maintainability was a challenge. In addition, the fusion module generates fused code during code generation rather than at runtime. As a result, the actual runtime code remains efficient without inefficient conditional statements, allowing efficient execution. This solves the second problem of inefficient fusion performance at runtime.

Why Optimium?
Exactly! In Optimium, leveraging our in-house language Nadya enables us to harness the benefits of metaprogramming to address the shortcomings of traditional fusion methods. As a result, we can support a broader range of fusion capabilities compared to existing inference engines while still providing efficient runtime performance.

The graph above illustrates the performance increment by applying Fusion in Optimium. As you can see, Fusion accelerates performance by approximately 1.3 times. This improvement is achievable because Optimium supports not only basic Fusion operations like Convolution+ReLU, which are also supported by other inference engines, but also a variety of other operations such as Binary operators and Padding, with Fusion capability extended to even 3โ4 operations. While this measurement was taken on a Raspberry Pi 5, similar enhancements can be observed on other devices as well.
Run Optimium , run!๐
Currently in beta testing, Optimium has attracted the attention of several companies due to its superior performance compared to existing inference optimisation engines across different hardware environments. Designed as a comprehensive collection of different optimisation techniques, not limited to Operator Fusion as described today, Optimium has already demonstrated superior inference speeds compared to widely used engines such as TensorFlow Lite and XNNPACK. If you are interested in experiencing Optimium first hand, please feel free to contact us at any time using the beta sign-up link below.
๐ https://wft8y29gq1z.typeform.com/to/Sv9In4SI

Life is too short, you need Optimium