Optimium 101 (2): My name is Nadya, a programming language for high-performance computing.

Optimium

Solutions

Company

Resources

Contact

Select Language

Optimium

Solutions

Company

Resources

Contact

Select Language

Product

Optimium 101 (2): My name is Nadya, a programming language for high-performance computing.

In order to accelerate the computation speed of AI models, we have developed a new language called Nadya for high-performance computing and it is being actively utilized in Optimium

Sewoong Moh

April 2, 2024

Hello. This is Sewoong Moh from the Optimium team, who is developing an AI inference optimization engine. As mentioned in the previous post, Optimium is a powerful inference engine for AI developers, supporting the easy and fast deployment of optimal AI models. In order to accelerate the computation speed of AI models, we have developed a new language called Nadya for high-performance computing and actively utilize it in Optimium. Today, I will provide a detailed explanation of ‘Nadya,’ a language developed by ENERZAi. Yes! it’s MADE BY ENERZAi.

Let’s have a quick review on our previous contents before we move on. The key advantages of utilizing Optimium for AI model deployment are as follows!

Saves unnecessary time spent on inference optimization.

Instead of manually optimizing models for deployment, AI models can undergo automatic inference optimization for optimal deployment through Optimium. You can now meet your tight release targets!

Delivers high compatibility with existing frameworks.

Supports basic operators usable in various frameworks and can be easily extended if additional operators are needed.

Provides flexible support regardless of manufacturer within the supported Architecture range.

Within the current supported microarchitecture range (x86/x64/Arm), it operates anywhere regardless of supplier/brand.

Optimium can deliver overwhelming performance compared to other state-of-the-art inference engines because it can automatically generate and optimize code to fit the given target hardware. This process involves aspects that cannot be implemented with existing general-purpose programming languages (e.g., Python or C++), so we have developed a new programming language called Nadya specifically for Optimium. Nadya enables simple writing while supporting automatic code generation at the language level and embedding optimization pipelines tailored to various target hardware.

What makes Nadya powerful? 1: Metaprogramming

Metaprogramming refers to enabling a program to write programs itself. Simply, rather than directly programming the actual code that will run, developers create programs that generate code to be executed in the future. Python also has some features similar to Metaprogramming, such as Decorators or Metaclasses. In the case of C++, Metaprogramming features through templates or macros are supported to some extent. However, for these languages dealing with large-scale data or generating complex code in Metaprogramming becomes extremely complex due to limited functionality, making it practically challenging to use.

However, Nadya has made significant efforts to easily support Metaprogramming from its design stage. With Nadya, developers can easily generate code as if handling general data without the need to learn particularly complex features. Therefore, Nadya can support very powerful and flexible Metaprogramming that no programming language has been able to support until now.

💡 <Metaprogramming>

Values can be assigned to types without the need to instantiate objects, and operations can be performed on those types.
The advantage is that it can pre-calculate parts that are statically computed at compile time, enhancing program execution speed.
However, because operations are performed at compile time, debugging is impossible, making it very challenging to find bugs in code written with Metaprogramming..

So, Why is Metaprogramming Important for Optimium’s Performance Optimization?

You might wonder why we can’t just write code without going through complex Metaprogramming. Of course, that could be done, but then developers would have to write much longer code multiple times. Most computers have very picky tastes, so the form of programs that run fast varies. Even for the same matrix operation algorithm, the code that runs quickly on Arm-based chips, which are commonly used in smartphones, is different from the code that runs well on Intel chips. Moreover, even among Intel or Arm-based chips, it may vary depending on the type or generation. To consider all these cases, you really need to write a lot of code. In reality, existing inference engines like TensorFlow Lite produce different algorithms for different hardware to optimize just one layer.

However, with the capability of Metaprogramming like Nadya, the circumstance changes. Since we can already know at compile time which hardware it will run on, all we need to do is generate code tailored to that hardware, which reduces additional effort. Therefore, a 100-line Nadya program using Metaprogramming can achieve better performance than 1,000 to 10,000 lines of code written in other programming languages. It is precisely because of these advantages that our team can achieve high level of optimization with fewer people compared to existing inference engines with more developers involved.

Optimium utilizes Metaprogramming in Nadya to generate code in various ways, finding the implementation that runs fastest on your hardware. As a result, the program produced is tailored to your computer situation. Then, your computer will execute the program faster and more efficiently, like a marathon runner wearing custom-made running shoes instead of off-the-shelf sneakers. This is why Optimium can provide excellent performance without being constrained by the specifications and types of hardware.

Efforts in Nadya to Facilitate Easy Utilization of Metaprogramming

Semantics & Syntax of code executed in compile-time for Metaprogramming and code executed in execution time are compatible. Which means, you don’t have to learn new semantics for Metaprogramming the code (Except for just small set of differences). You can even run code generated by Metaprogramming in compile time!

You can treat expressions as value. Just like any other values, you can compose them together, or pass to lambda (closure) parameter as an argument.

let expression = !{a + b} // Initialize value 'expression' to runtime expression 'a + b' 
print(expression) // Print 'a + b'
let twoExprs = [expression;expression] // Creates list of expressions

You can nest expressions. Which means, you may have compile-time evaluated piece of code inside expression you’re building.

let res = fib 10 // 'fib' is a lambda function returning value of fibonacci sequence at given index. In this case, res would be 55
let expression = !{a + ${res}} // Build expression of "a + (compile time output of 'res')".  If 'res' was evaluated to '55', generate "a + 55" as runtime expression
print(expression) // Print 'a + 55'

What makes Nadya powerful? 2: Smart Compiler Optimization Pipeline

In order for code generated through Metaprogramming by Nadya to be executable, it needs to be translated into machine code that the computer can read. This process is performed by the compiler. However, Nadya does not simply compile code in a straightforward 1-to-1 mapping way. In Nadya, compilation progresses through various pipelines as follows:

Automatic parallelization
The Nadya compiler has the capability to analyze loops by itself to determine if parallelization is feasible without issues.
When the developer’s computing environment (such as the ability to use multiple cores, presence of power issues, etc.) can support parallelization, automatic parallelization can be enabled by providing “attr[Parallel: true]” in the code as shown below.
// Matmul implementation in Nadya language // f32 : 32bit floating point, i32 : 32bit signed integer fun matmul(mut &c : tensor<f32, 2>, a : tensor<f32, 2>, b : tensor<f32, 2>) -> i32 { attr[Parallel : true] // Use Automatic parallelization for(mIdx from 0 to 8 step 1){ // ... Implementation } 0 // Return 0 on success }
Automatic memory optimization
Computers typically operate faster when accessing memory addresses that are the same or close in proximity. Nadya analyzes the memory pattern of developer code and optimizes it to access similar memory addresses as much as possible.
Automatic vectorization
Most computers today can perform multiple operations simultaneously. Even when adding two numbers, it can be done not only one by one, but also 4 or 8 at a time (this is called Vectorization). Originally, specialized hardware knowledge was required to use this feature. However, don’t worry. Nadya autonomously analyzes developer code to enable the computer to perform multiple operations at once, resulting in faster operation.

Before Meeting Nadya (feat. Automatic Vectorization Not Applied)

There are various computers and chips in the world, and the conditions under which they optimally operate vary. Optimization methods such as vectorization or loop unrolling can affect performance depending on how and to what extent they are applied. Existing inference engines have fixed settings in these aspects, making it difficult to change. Additionally, there is a laborious process of optimizing for the environment to use various optimization methods desired by developers. However, Optimium, which utilizes Nadya, is different. To illustrate the power of Nadya, let’s consider an example using matrix multiplication, one of the most commonly used operations in AI such as Transformer and Convolution.

💡 <Vectorization>

Operations are performed using optimized Array Expressions, replacing loops, which are related to SIMD (Single Instruction Multiple Data) operations on the hardware.
The advantage is that it can pre-calculate parts that are statically computed at compile time, enhancing program execution speed.

// Matmul implementation in Nadya language
// f32 : 32bit floating point, i32 : 32bit signed integer
fun matmul(mut &c : tensor<f32, 2>, a : tensor<f32, 2>, b : tensor<f32, 2>) -> i32 {
 for(mIdx from 0 to 8 step 1){
  for(nIdx from 0 to 8 step 1){
   let mut acc = 0.0f
   for(kIdx from 0 to 8 step 1){
    acc <- acc + a[(mIdx, kIdx)]*b[(kIdx, nIdx)]
   } 
   // Assign the result (acc) to c
   c[(mIdx, nIdx)] <- acc
  }
 }
 0 // Return 0 on success 
}

fun main() -> i32 {
 // Define input and output tensors
 let a = tensor((8, 8), 1.0f) // Initialize Tensor a with a shape of 8x8 to 1.0f.
 let b = tensor((8, 8), 2.0f) // Initialize Tensor b with a shape of 8x8 to 2.0f.
 let mut c = tensor((8, 8), 0.0f) // Initialize Tensor c with a shape of 8x8 to 0.0f.
 matmul(&c, a, b)
}

!{
 main() 
}

The provided code is not yet optimized with Nadya. The code inside the Fun function is compiled directly into binary code tailored for the target hardware, making it executable immediately. However, the above code struggles to achieve high performance. There are various reasons for this, but the main reason is that it can only perform operations on one value at a time.

When compiling the unoptimized code for the Arm Cortex-X1 target, the following assembly instructions are outputted, which represent the part of the assembly language that performs multiplication and addition operations within the innermost loop. Upon closer inspection, it can be seen that each number is operated on separately.

# Inside the Main Loop
...
ldr s3, [x11, x22, lsl #2]
fadd s1, s1, s0 # multiplication one number at a time
fmul s3, s4, s3 # multiplication one number at a time
fadd s1, s1, s2 # Addition one number at a time

If we were to visualize this, it would look like the following image. Matrix C is the result of multiplying matrices A and B. In the current situation, if Matrix B has fixed vector values, it takes relatively more time to derive the result.

After meeting Nadya (feat. Automatic Vectorization Applied)

As mentioned earlier, vectorization means performing operations on multiple values at once instead of one at a time. Applying vectorization reduces the number of instructions to execute, resulting in improved speed. With slight modifications to the code written above, you can vectorize the code to the desired size. The following code is an example of how Nadya optimizes by vectorizing the code as much as the developer desires.

// Vectorized Matmul implementation in Nadya language
// vectorBitWidth : The maximum available vector bit width
template</vectorBitWidth/>
attr[ Optimization : { VectorSize : vectorBitWidth }]
fun matmul(mut &c : tensor<f32, 2>, a : tensor<f32, 2>, b : tensor<f32, 2>) -> i32 {
  // The number of elements that fit into the vector bitwidth
  // Computed at compile time
    let elems = ${vectorBitWidth} / 32 
    for(mIdx from 0 to 8 step 1){
  for(nIdx from 0 to 8 step elems){
   let mut acc = tensor((elems,), 0.0f)
   for(kIdx from 0 to 8 step 1){
    // Compute numbers in batches of elems at once
    acc <- acc + a[(mIdx, kIdx:kIdx+1:1)]*b[(kIdx, nIdx:nIdx+elems:1)]
   } 
   // Assign the result (acc) to c
   c[(mIdx, nIdx:nIdx+elems:1)] <- acc
  }
 }
 0 // Return 0 on success 
}

template</vectorBitWidth/>
fun main() -> i32 {
 let a = tensor((8, 8), 1.0f) // Initialize Tensor a with a shape of 8x8 to 1.0f.
 let b = tensor((8, 8), 2.0f) // Initialize Tensor b with a shape of 8x8 to 2.0f.
 let mut c = tensor((8, 8), 0.0f) // Initialize Tensor c with a shape of 8x8 to 0.0f.
 matmul</vectorBitWidth/>(&c, a, b)
}

// The size of the vector to be used
let vectorBitWidth = 128

!{ 
 // Define input and output tensors
 // Call a separate function to pass the vectorBitWidth parameter to the main function.
 main</vectorBitWidth/>()
}

In the code above, by setting the value of VectorBitWidth (the size of the vector in bits), it determines the size of the vector the program will use. Here, the Nadya code generates code that adjusts the Vector Size based on the size of VectorBitWidth. Without changing any other part, simply adjusting the value of VectorBitWidth allows to easily determine the VectorBitWidth used by the program. Unlike the image above, the vector values of matrix B have become wider than before, allowing for faster operations.

In this way, in Nadya, if you specify the desired size of the vector regardless of the processor type, the compiler will automatically optimize it. For processors based on Arm architecture like Raspberry Pi and Amazon Graviton, 128-bit (theoretical maximum of 2,048-bit) vectors are commonly used, while processors based on Intel and AMD architectures typically use 256-bit or 512-bit vectors. However, in the case of Optimium, the value of VectorBitWidth can be experimented with multiple candidates to find the optimal performance.

When compiling the above code for the Arm Cortex-X1 target, the following assembly instructions are outputted.

... 
ldr q2, [x16]
fmla.4s v1, v0, v3[0]
ldp s0, s3, [x0, #-8]
fmla.4s v1, v2, v0[0]

Unlike before, the ‘fmla’ operation has been added to perform optimization by calculating four numbers at once.

The ‘fmla’ operation performs multiplication and addition simultaneously.
‘.4s’ means it operates on four numbers at once.

By doing this, the computation speed becomes much faster. Through Nadya’s code generation feature, Optimium can compile code to fit various hardware without the need to rewrite code for each CPU or architecture, but simply adjusting a single value. Layers in Optimium are implemented using Nadya in this way. While the example above demonstrates a simple code vectorization, real operations implemented in Optimium utilize a wide range of optimization techniques and advanced code generation features.

Optimium adjusts values that affect program code generation and seeks the code that performs the fastest on the target where the actual code is executed. Unlike inference engines written in C/C++ or assembly language, there is no need to tailor code differently for each type of hardware. The Optimium team has conducted sophisticated optimization for each hardware target using Nadya, and operations written with Nadya work with just one implementation across all supported hardware targets. This allows Optimium to provide overwhelming performance compared to existing state-of-the-art inference engines.

Optimium powered by Nadya!

While Nadya currently supports only CPUs, there are plans to add support for new hardware in the future, including GPUs (such as CUDA and Vulkan). Additionally, although Nadya cannot be directly accessed during this beta test, there are considerations to publicly release it after further validation and stabilization. Therefore, your continued interest in Nadya is highly appreciated. 🙂

Optimium, developed by our team based on Nadya, is currently undergoing beta testing. It has been garnering attention from various companies for its superior performance compared to existing inference optimization engines across various hardware environments. It achieves superior inference speed compared to TensorFlow Lite XNNPACK, which is widely used. If you’re interested in experiencing it firsthand, please feel free to contact us through the beta application link below at any time.

👉 https://wft8y29gq1z.typeform.com/to/fp059MY5