Company

Resources

Company

Resources

Product

Optimium 101 (8)-Nadya Optimizing Compiler

Optimium 101 (8)-Nadya Optimizing Compiler

Optimium 101 (8)-Nadya Optimizing Compiler

Today, we will talk about compiler technology inside Nadya, which analyzes & optimizes user’s code under the hood, and generates best possible output that runs fast and efficient on target hardware.

Jaewoo Kim

2024년 7월 8일

Hello, my name is Jaewoo Kim, and I am developing Nadya at ENERZAi. Today, we will talk about compiler technology inside Nadya, which analyzes & optimizes user’s code under the hood, and generates best possible output that runs fast and efficient on target hardware.

Nadya Optimizing Compiler (ndc)

ndc is the name of our compiler, which coverts Nadya code into binary (Not just a ‘binary’, it’s highly optimized binary that only ndc can provide). Nadya optimizing compiler is our backbone technology on building fast computational kernels used in Optimium. Moreover, it allows any programmers to write fast and efficient program without knowing too much details about hardware. Let me introduce brief process under the hood.

Parsing & lexing user code into Abstract Syntax Tree (AST)

As we all know, computer does not understand human language, (such as English or Korean). They only know machine language composed of 0s and 1s. However, we all know no one wants to code with brain damaging numbers (If you see someone who does, you should probably consider taking them to hospital).

Compiler is a software that solves this problem. It translates code written in human-readable language into machine code. The first step to do so is parsing & lexing.

Lexer

Lexer decomposes code into ‘tokens’, which is a minimum piece of component inside the code that means something.

For example, let’s say we write the following code.

let myVar = a + 1

If Lexer reads this piece of string, it decomposes the code into the following tokens.

let, myVar, = , a, +, 1

This case is simple however, in real world, same kind of character may have different meaning. (For example, < means ‘less than’ normally. However, in the syntax cast<i32>(...) , < refers to the beginning of type expression). Therefore, parser & lexer needs to be context-sensitive, which is resolved by two-pass parsing in Nadya.

Parser

Parser takes the tokens from the lexer and returns an Abstract syntax tree (AST in short). AST transforms code into tree-like structure that defines the program. Each node of the AST represents semantics of the language, which defines how program should behave in certain condition.

Let’s transform the code above into AST defined in Nadya

As we can see here, This tree contains several important information that’s required to execute the program.

  1. Components of the program(Literal, id, number…)

  2. How they are structured

In order to execute the semantics of node “let”, we need the results of node “myVar” and “Add”. In other words, you need to traverse this tree in proper order to execute the program.

Operations on AST

Using AST, Nadya performs certain conversions and analysis on the AST. One of the analysis is type inference. Type inference process is done on this AST, inferencing the resulting type of the operation. For example, if we take a look at the AST above, we can find out that a and 1 are 32bit integers, and the added result should be a 32bit integer as well. Specifically, Nadya uses advanced algorithm inspired by HM (Hindley Milner) algorithm to achieve this task. In this process, Nadya resolves type of all expressions, including template types, and tells user if there are any type mismatches.

Another thing that Nadya performs on AST is constant folding. Constant folding is the mechanism for the compiler that pre-calculates deterministic values on compile time rather than runtime, reducing the execution cost. For example, in the AST above, if compiler can determine value of a statically, it can add it to 1 and replace the add node with the added value. Nadya performs this both on AST level and afterwards using IR (described later). Simple ones are handled using AST, and complicated ones (Considering branches and dead code elimination) is performed on IR. Pre-calculating constant folding on AST has benefits since it reduces size of AST, and it makes other analysis or operations performed on AST faster and more efficient.

Operations on Intermediate representation

Once AST has been constructed, AST can be turned into intermediate representation of the IR. Intermediate representation is layer between AST and machine code that expresses the program. IR generator traverses ast, creating IR that preserves all operations on AST.

In Nadya, different kind of IRs are organized together to represent whole structure of the program.

Each IR represents semantics of the expression. For example, adding tensors can be represented as IR, such astensor.add %a, %b : (tensor<i32, 4>, tensor<i32, 4>) -> tensor<i32, 4>. This represents compiler should figure out way to add two tensors, and convert them into assembly.

In case of Nadya, we need to perform complicated optimization process using this IR. These optimization process may have various layers. For example, determining if loop can be parallelized can be done with high-level context, but choosing right hardware intrinsic function to run certain code has to be determined using low-level context, where we know exactly which operations are going to happen.

Therefore, Nadya uses multiple layer of intermediate representation for processing various optimizations, based on MLIR. In Nadya and MLIR, we call this kind of IR ‘Dialect’. Here are lists of dialects and some optimizations performed on the corresponding dialect.

High-level dialects

  1. NadyaCore dialect

  • Models ownership semantics of Nadya programing model

  • Models pattern matching and union types

  • Models user defined type operations

  • Forwarding l-values to r-value is performed here

  • Mem2Reg (forwarding memory load-stores to SSA value) is performed here

2. NadyaFunc dialect

  • Models function calls, symbols, and closures

  • Inlining optimization is performed here

3. NadyaSCF dialect

  • Models control flow of the program

  • Analysis on parallelism, unroll is performed here

  • Analysis simplication of if/else operation is performed here

  • Loop invariant code motion (LICM) is performed here

4. NadyaAST dialect

  • Models AST construction feature of Nadya

Mid-level dialects

  1. NadyaArith dialect

  • Models arithmetic operations performed on Nadya

  • Optimization on arithmetics are performed here

  • Code vectorization is performed here

2. NadyaTensor dialect

  • Models operation on tensors

  • Optimization on tensor memory operations are performed here

Low-level dialects

  1. NadyaBase dialect

  • Models pointers, indirect calls, struct access

2. NadyaNeon dialect

  • Models Intrinsic forwarding to ARM Neon intrinsics

Each dialect is positioned on certain hierarchy, performing its own purpose. Each dialects can interact each other for performing certain analysis and transformations, but they all have their own role.

Progressively, ndc converts operations written in high-level IR into low-level IR. It can also update on its own without coverting itself to other kind of dialects. However, conversion from lower-level dialect to higher-level dialect cannot be performed (That’s why we call it ‘hierarchy’)

Here is the example of an IR. This IR has been converted directly from AST, resolving semantics of the program.

%24 = nadyaTensor.constructRange %arg5, %23, %c1_i32_26 : (i32, i32, i32) -> !nadyaTensor.range<i32> loc(#loc37)
%25 = nadyaTensor.load %0[%arg2, %24] : (!nadyaTensor.tensor<f32,(-1, -1)>, <i32, !nadyaTensor.range<i32>>) -> !nadyaTensor.tensor<f32,(-1)> loc(#loc38)
%26 = nadyaArith.addTensor %19, %25 : (!nadyaTensor.tensor<f32,(-1)>, !nadyaTensor.tensor<f32,(-1)>) -> !nadyaTensor.tensor<f32,(-1)> loc(#loc33)
nadyaTensor.store %26, %arg6[%arg2, %12] : (!nadyaTensor.tensor<f32,(-1)>, !nadyaTensor.tensor<f32,(-1, -1)>, <i32, !nadyaTensor.range<i32>>) loc(#loc29)

After running vectorization and shape inference pass, above IR is translated as follows.

%19 = nadyaTensor.constructRange %arg5, %18, %c1_i32 : (i32, i32, i32) -> !nadyaTensor.range<i32>
%20 = nadyaTensor.load %0[%arg2, %19] : (!nadyaTensor.tensor<f32,(4096, 64)>, <i32, !nadyaTensor.range<i32>>) -> !nadyaTensor.tensor<f32,(1),vector>
%21 = nadyaArith.addTensor %15, %20 : (!nadyaTensor.tensor<f32,(-1)>, !nadyaTensor.tensor<f32,(1),vector>) -> !nadyaTensor.tensor<f32,(-1)>
nadyaTensor.store %21, %arg6[%arg2, %10] : (!nadyaTensor.tensor<f32,(-1)>, !nadyaTensor.tensor<f32,(4096, 64)>, <i32, !nadyaTensor.range<i32>>)

You can see some -1 converted to actual number. This can be generated because compiler performed constant propagation pass, with automatic shape inference. That is, compiler pre-computes deterministic values on the program, and inferences the shape of the tensor being used in the program.

Compiler also finds out vectorization opportunity on the program. We can see some tensors with vector attached at the end. This means operations on such tensor can be performed with SIMD intrinsics, without any side effects.

If we go further, the IR is converted as follows.

nadyaSCF.for %arg4 from %c0_5 to %40 step %c1_6 {
  %52 = arith.andi %arg4, %42 : index
  %53 = arith.andi %arg4, %44 : index
  %54 = arith.addi %arg4, %c1_6 : index
  %55 = arith.addi %52, %c1_6 : index
  %56 = arith.addi %53, %c1_6 : index
  %57 = nadyaTensor.constructRange %52, %55, %c1_6 : (index, index, index) -> !nadyaTensor.range<index>
  %58 = nadyaTensor.constructRange %53, %56, %c1_6 : (index, index, index) -> !nadyaTensor.range<index>
  %59 = nadyaTensor.constructRange %arg4, %54, %c1_6 : (index, index, index) -> !nadyaTensor.range<index>
  %60 = nadyaTensor.load %24[%57] : (!nadyaTensor.tensor<f32,(-1)>, <!nadyaTensor.range<index>>) -> !nadyaTensor.tensor<f32,(1),vector>
  %61 = nadyaArith.addTensor %60, %31 : (!nadyaTensor.tensor<f32,(1),vector>, !nadyaTensor.tensor<f32,(1),vector>) -> !nadyaTensor.tensor<f32,(1),vector>
  nadyaTensor.store %61, %36[%59] : (!nadyaTensor.tensor<f32,(1),vector>, !nadyaTensor.tensor<f32,(-1)>, <!nadyaTensor.range<index>>)
} {parallel = false}

Now, we can see compiler generated a loop. In this loop, we can see elements are accessed with index from the memory, and compiler is performing operation on it. for the tensors with vector attached to its type, operation can be forwarded into SIMD. However, for this specific case, size of the tensor is 1, so its forwarded as normal add operation. (if it was any bigger than 1, it would be translated to SIMD operation, such as AVX or Neon).

Now, with some further optimizations, this IR can be coverted into LLVM.

After performing progressive lowering with MLIR, it is finally converted into LLVM IR, which is another intermediate representation determined in LLVM project.

Here is the LLVM representation of body of above loop.

^bb9:  // pred: ^bb8
  %84 = llvm.and %82, %81  : i64
  %85 = llvm.getelementptr %45[%84] : (!llvm.ptr, i64) -> !llvm.ptr, f32
  %86 = llvm.load %85 : !llvm.ptr -> f32
  %87 = llvm.mlir.undef : vector<1xf32>
  %88 = llvm.mlir.constant(0 : i32) : i32
  %89 = llvm.insertelement %86, %87[%88 : i32] : vector<1xf32>
  %90 = llvm.shufflevector %89, %87 [0] : vector<1xf32> 
  %91 = llvm.fadd %90, %69  {fastmathFlags = #llvm.fastmath<contract>} : vector<1xf32>
  %92 = llvm.mlir.constant(0 : i64) : i64
  %93 = llvm.extractelement %91[%92 : i64] : vector<1xf32>
  %94 = llvm.getelementptr %73[%82] : (!llvm.ptr, i64) -> !llvm.ptr, f32
  llvm.store %93, %94 : f32, !llvm.ptr
  %95 = llvm.add %82, %4  : i64
  llvm.br ^bb8(%95 : i64)

We can see LLVM is still performing what we were doing previously, generating correct function output.

This LLVM IR can be converted into binary after performing LLVM’s internal optimization passes, (The ones included in clang -O3) and that becomes final output of the ndc!

Nadya language has its own semantics which makes compiler analysis easier, and our customized intermediate representation maximizes this strength. Nadya semantics, its IR, and optimization passes on it collaborate together for generating fast and efficient code.

Optimium

Optimium

Solutions

Resources

ENERZAi

Copyright ⓒ ENERZAi Inc. All Rights Reserved

사업자등록번호: 246-86-01405

이메일: contact@enerzai.com

연락처: +82 (2) 883 1231

주소: 대한민국 서울특별시 강남구 테헤란로27길 27

Optimium

Optimium

Solutions

Resources

ENERZAi

Copyright ⓒ ENERZAi Inc. All Rights Reserved

사업자등록번호: 246-86-01405

이메일: contact@enerzai.com

연락처: +82 (2) 883 1231

주소: 대한민국 서울특별시 강남구 테헤란로27길 27

Optimium

Optimium

Solutions

Resources

ENERZAi

Copyright ⓒ ENERZAi Inc. All Rights Reserved

사업자등록번호: 246-86-01405

이메일: contact@enerzai.com

연락처: +82 (2) 883 1231

주소: 대한민국 서울특별시 강남구 테헤란로27길 27