Optimium 101 (5) — Introduction to Nadya

Optimium

Solutions

Company

Resources

Contact

Select Language

Optimium

Solutions

Company

Resources

Contact

Select Language

Product

Optimium 101 (5) — Introduction to Nadya

Currently used for internal computations within Optimium, Nadya aims to become a high-performance computing tailored programming language that can be used anywhere. The Nadya development team is working tirelessly to realize the vast potential of high-performance computing.

Jaewoo Kim

May 16, 2024

Hello, my name is Jaewoo Kim, and I am developing Nadya at ENERZAi. Previously, as part of the Optimium team, I briefly introduced Nadya’s optimization features. Today, I want to focus more on Nadya itself. Currently used for internal computations within Optimium, Nadya aims to become a high-performance computing tailored programming language that can be used anywhere. The Nadya development team is working tirelessly to realize the vast potential of high-performance computing. I am delighted to introduce Nadya to you.

What is Nadya?

Nadya is a programming language developed to enable anyone to develop high-performance computing (HPC) software easily, quickly, and safely. Currently, it is used for implementing internal computations in Optimium and is continuously evolving through ongoing research and development.

Background of Nadya’s Creation

When initially developing Optimium, we encountered the following issues using compiled languages like C or Rust for internal computation kernels:

Achieving faster performance took a long time and required significant effort.
Even with effort, performance did not meet expectations on different platforms.

To make Optimium’s performance universally efficient, we would have needed to manually implement numerous computations for every hardware, which was impractical for our team with limited resources. Therefore, we decided to generate target-specific code for computations.

So, how can we achieve code generation? The simplest approach that comes to mind is directly generating C code. This involves creating quantifiable patterns and performing code generation based on those patterns. Although this method is commonly used by existing solutions, it wasn’t suitable for achieving the performance level we desired.

Therefore, we decided to create a new programming language. Our goal was to support code generation at the language level, provide strong compile-time optimizations, and facilitate the structuring of computational tasks through functional paradigm support. After much effort, this led to the creation of Nadya.

Design Direction

Although the impetus for creating Nadya was Optimium, our team aimed to develop a universal language suitable for all applications requiring high-performance programming. The new language needed to be vertsatile, and easy to use while being suitable for Optimium. In summary:

Enables quick and easy HPC code writing (improving).
Operates fully and independently anywhere (supported).
Supports strong compiler optimizations without programmer intervention (supported, improving).
Built-in metaprogramming and code generation (supported, improving).
Ensures memory safety (in progress).
Ensures type safety (in progress).
Simplifies computations with functional paradigm support (partially supported, improving).

The goals are ambitious, but the engineers developing Nadya have chosen to embrace the challenge. As they say, you make bold moves and then manage the aftermath. Some goals have already been achieved, and rigorous research and development are ongoing to accomplish all the objectives.

Shall we take a look at Nadya’s key features one by one?

(Some features are still under development, so the exact syntax may change before the official release.)

Powerful Compiler Optimization

The Nadya compiler supports the analysis and optimization of user code patterns, going beyond typical compiler options like -O3, with built-in language-level optimization features.

Here are some of the compiler optimization features in Nadya:

Code Pattern Analysis and Intrinsic Lowering

When users write code, the Nadya compiler converts it into an intermediate representation (IR) specifically designed for Nadya. It analyzes patterns in this IR and replaces them with hardware instructions that can accelerate the pattern.
For example, ARM has an instruction called VQDMULHQ which multiplies two vectors, multiplies by 2, and returns the high half. In Nadya, the following code would automatically use this instruction through pattern analysis:

let tensorA = tensor((8,), dataA) 
let tensorB = tensor((8,), dataB) 
let result = (tensorA * tensorB * 2) >> 16

Below is the assembly code when the above code is compiled for ARM64:

...  
stp x29, x30, [sp, #-16]!           ; 16-byte Folded Spill  
bl _vqdmulhqs32x4 // VQDMULHQ instruction working on 32bit x 4 vector  
ldp x29, x30, [sp], #16             ; 16-byte Folded Reload

This can significantly enhance performance by simplifying logic that would otherwise need to be manually written or looped through in other languages.

Cache Optimization

The Nadya compiler analyzes memory allocation logic to maximize cache locality. Typically, cache performance improves when the same location is accessed repeatedly or when sequential addresses are accessed. To achieve this, Nadya adjusts the program to repeatedly access similar memory areas, as long as it doesn’t compromise code correctness. Let’s revisit the example from point 1:

let tensorA = tensor((8,), dataA) 
let tensorB = tensor((8,), dataB) 
let result = (tensorA * tensorB * 2) >> 16  

// Access tensorA 
print(tensorA[(1, )])  

// tensorB is not accessed again. 
// Thus, the result can use tensorB's memory as they have the same size.

In this example, tensorA is reused, but tensorB is not used after its initial computation. Since result and tensorB require the same amount of memory, the Nadya compiler can optimize by allowing result to use the same memory as tensorB. This reduces the overall memory usage and makes the code more cache-friendly.

3. Automatic Parallelization

The Nadya compiler has a built-in feature that automatically parallelizes loops if there are no destructive updates (sequential updates). For example, if the code is written as follows, and the compiler determines the loop can be parallelized, it will do so:

attr[Parallel] 
for(idx from 0 to 10 step 1){
 // Implementation 
}

4. Stack Forwarding & Register Forwarding

The Nadya compiler has the capability to eliminate unnecessary heap allocations and directly manage memory. If a structure is suitable for stack allocation, it automatically assigns the address to the stack. Furthermore, if stack-allocated memory can be moved to registers, the compiler supports this transfer.
The rationale behind these features is that stack memory allocation is significantly faster and eliminates the risk of memory leaks. Stack memory is automatically released when a function terminates, and allocation is completed by updating the processor’s stack pointer, making it very fast.
If we go a step further and store data in registers, the processor can operate much faster since it doesn’t need to access memory. However, excessive use of registers can lead to a shortage of necessary registers, which may degrade performance, so appropriate adjustments are necessary.
The Nadya compiler handles these processes automatically, taking care of the parts that programmers would normally have to manage manually according to the architecture.

There are many other optimization techniques not covered here, but since it exceeds the scope of this document, they will be introduced separately in future articles.

Simplified Data Operations

Nadya features a unique type called ‘tensor’. The tensor type is designed to easily handle data similar to matrices and is defined by its shape and data type. Tensors support compiler-level optimizations, allowing programmers to manage data efficiently.

You can define and use a tensor as follows:

let a = tensor((2,3), {1,0f,2,0f,3,0f,4.0f,5.0f,6.0f}) // 32bit floating point tensor shaped (2,3) 
let b = tensor((3,), {1.0f, 2.0f, 3.0f}) // 32bit floating point tensor shaped (3,) 
// Tensor arithmetics (automatically broadcasted) 
print(a + b) // tensor((2,3), {2.0f, 4.0f, 6.0f, 5.0f, 7.0f, 9.0f})

You can also reference and modify a tensor as shown below, with compiler optimizations still functioning correctly:

let mut a = tensor((2,3), {1,0f,2,0f,3,0f,4.0f,5.0f,6.0f}) // 32bit floating point tensor shaped (2,3) 
let mut refA = &a 
refA[0, 0] <- 3.0f // Modifies both a & refA 
print(a) // prints tensor((2,3), {3,0f,2,0f,3,0f,4.0f,5.0f,6.0f})

Functional Paradigm

The functional paradigm is one of the areas we focus heavily on. Enhancing performance while maintaining a functional paradigm is challenging, but we believe it provides significant value to programmers. It ensures type safety and allows for simpler program definitions. Functional programming in Nadya can be defined as follows:

Pure Functions: Functions without side effects, making it easier for programmers to model how their program will behave.

// Pure function 
let pureFunction a b = a * b + 10 

// Recursive pure function 
let rec recursivePureFunc input =
 // pattern matching
 match input with
 | 0 -> 0
 | _ -> input + recursivePureFunc (input - 1)

If you can define pure functions, they can be easily utilized for high-performance computations through parallel processing. In parallel programming, side effects are the biggest obstacle. Ensuring that no side effects occur at the language level makes it easier to manage parallel tasks. Here’s an example:

import core.Parallel 

// Perform parallel multiplication of list elements by 10 
let multipledList = core.Parallel.map <| [0;1;2;3;4] <| (lambda (elem) { elem * 10 }

Treating Functions as Values. You can treat functions as values in Nadya. Unlike C++ and Rust, where capturing external variables changes the lambda expression’s type, Nadya treats functions with the same arguments and return types as the same type. This eliminates the need for concepts like C++’s type erasure to unify lambda expression types. For example:

let outSide = 2 
let funcA a b = a + b 
let funcB c d = c + d + outSide 

// Since funcA and funcB have the same type, they can be stored in the same data structure. 
let closures = [funcA; funcB]

Memory Safety

Nadya is designed to minimize writing unsafe memory code at the language level. It introduces the concept of ownership to manage this. Nadya tracks how data object ownership moves to prevent incorrect memory usage as much as possible. This feature is currently implemented and will continue to evolve.

Ownership

Data objects can be borrowed (reference), copied, or moved, with a maximum of one variable holding ownership at any given time.
Owning a data object means the owning variable has priority over any variables borrowing the data and is responsible for its creation and deletion.

let owned = DataType(10) // owned holds ownership of DataType(10) 
// When the scope ends, owned and DataType(10) will be deallocated.

2. If a variable without ownership wants to read or modify the data, it must borrow (reference) the data from the owning binding.

let owned = DataType(10) 
let borrow = &owned // borrow references data from owned 
let newValue = (@borrow).addTo(11) // Access through reference

This ensures that variables referencing the data can access it but cannot deallocate or arbitrarily transfer ownership. The purpose of this design is to allow the owning variable to transparently manage data creation and deletion, which is crucial for safe memory management.

3. The owning variable must outlive the referencing bindings.

Nadya analyzes this at compile time and alerts the programmer if this guarantee isn’t met, preventing memory reference errors. This feature ensures that if a referencing variable outlives the owning variable, it won’t access deallocated memory, which could cause errors.

let owned = DataType(10) 
let borrow = &owned // borrow references data from owned 
let moved_borrow = move(borrow) // Transfer ownership to moved_borrow 

print("{}", @borrow) // Error! borrow cannot be used after being moved 
print("{}", @moved_borrow) // Prints DataType(10)

Built-in Code Generator

One of Nadya’s distinguishing features is its language-level code generation capability. This allows programmers to define and generate new Nadya code within Nadya itself. While this feature may not be critical in most contexts, it is essential for a high-performance programming language like Nadya. This functionality enables Optimium to generate different executable code tailored to the user’s environment, achieving high optimization.

For example, depending on the input size, it may perform standard matrix multiplication for smaller sizes or tiled matrix multiplication for larger sizes. You can compare the input size to a threshold as follows.

You might wonder why not handle this branching in real-time based on the input. Pre-generating the code can reduce the size of the actual target code and save the time required for algorithm or code selection during execution.

// Generate code creating default matmul, or tiled matmul depending on input size
// expression stores generated code by each function

let mut expression = !{0}

if(inputSize < threshold) {
    expression <- generateDefaultMatmul();
} else {
    expression <- generateTiledConv();
}

// Use generated code for building gemm algorithm
// !{ code } syntax indicates a code expression to be generated.
//! ${ code } syntax is used for immediate evaluation within the generating expression.
let gemm = !{ alpha * ${expression} } + beta * bias}

Future Plans

Nadya aims to enable even those without specialized knowledge to quickly and safely write high-performance programs. The development team is continually working towards this goal. Upcoming features for Nadya include:

GPU (CUDA, Vulkan) Support

Enabling easy and straightforward GPU programming, even for those unfamiliar with CUDA or Vulkan.

2. Matrix Extension Intrinsic Support for CPUs

Supporting matrix multiplication units in modern CPUs to accelerate AI computations.

3. Support for Various Parallel Programming Techniques

Researching built-in support for diverse parallelization methods beyond simple loop parallelization.

4. Enhanced Compiler Optimization

Continuing to research and implement various techniques to further strengthen the Nadya compiler.

In this way, Nadya aims to bring developers closer to writing high-performance programs easily and quickly through various methods. Nadya is currently under active development and will continue to undergo many improvements. We appreciate your interest and support as you follow our journey.

Optimium

Solutions

Company

Resources

ENERZAi

Business number: 246-86-01405