Product
This week’s post is about the runtime that is utilized when executing Optimium. The runtime performs a variety of tasks, and today, I would like to introduce memory planning, which efficiently allocates memory.
Jin-Hwan Shin
2024년 6월 17일
Hello, my name is Jin-Hwan Shin, and I am developing runtime at ENERZAi. In previous posts, we have introduced the high-performance inference engine Optimium and its underlying programming language, Nadya. This week’s post is about the runtime that is utilized when executing Optimium. The runtime performs a variety of tasks, and today, I would like to introduce memory planning, which efficiently allocates memory.
Memory is finite
Compared to the past, we don’t worry as much about memory usage these days. Even in smartphones, the RAM size used to be 128MB or 512MB in the early days, but now it’s not uncommon to see sizes of 8GB or more. A few years ago, laptops typically came with 2GB of RAM, but now they often come with 16GB or even 32GB. In this environment, most software rarely consumes all the memory or crashes due to memory shortage.
However, the case is different for AI. Recent popular LLMs (Large Language Models) use a minimum of several gigabytes to hundreds of gigabytes of memory. It’s challenging to run most of these models unless you have a server-grade computer with large memory capacity. The problem becomes even more difficult when you need to serve multiple models simultaneously. This issue is magnified at the edge. Edge devices are likely to have even smaller memory capacities than smartphones. In addition to running the model, these devices might need to perform other tasks simultaneously, such as collecting input data or interacting with users, leaving less memory available for inference.
Despite the significant increase in memory size compared to the past, memory remains a finite resource for AI inference engines. Therefore, memory planning is essential to efficiently allocate memory, reducing the maximum required usage while still allowing model inference.
What is Memory planning?
Imagine you have a small drawer. You start piling all your documents in it. After a month, two months, or a year, the drawer will gradually fill up until you can’t put any more documents in it. At this point, you have two options: buy another drawer or empty the existing one. Although some documents in the existing drawer might be useful in the future, you are not likely to take a look at most of them. By discarding unnecessary documents, you can continue to store your documents in a single drawer.
Memory ultimately needs to store information about variables for current and future computations. Since memory can be repeatedly used and erased, we can remove specific variables that are no longer needed and store other necessary variables in their place. Memory planning is the process that aims to achieve efficient memory usage reduction by designing which information to store and which to erase. From now on, let me introduce how memory planning is utilized in Optimium.
Memory planning — Cutting and Pasting Tensor
First, we need to determine the execution order of the layers within the model. Knowing the execution order of the layers helps us decide which tensor to allocate first. Let’s take a look at the simple model below.
Assume that the layers are executed in the order of 1 → 2 → 4 → 3 → 5 in the graph above.
Based on this information, we track which tensor each layer accesses . For example, Layer 1 needs to access Tensor A and Tensor B, while Layer 2 needs to access Tensor B and Tensor C.
Once the analysis of which tensors each layer accesses and how many times the tensors are referenced is complete, memory planning begins based on such information.
When performing memory planning, we use an Allocation Table to track the currently active tensors and their memory ranges, and a Memory Plan Table to record which memory range each tensor will be allocated to.
For the sake of clarity, let’s assume the size of all tensors is fixed at 100.
For example, Layer 4 needs to access Tensor C and Tensor E. Since Tensor C was already allocated during the memory planning for Layer 2, it can be used as is, but Tensor E has not been allocated yet.
At this point, Tensor B in the Allocation Table is no longer referenced by any layer, so its memory can be safely reused. Therefore, Tensor E can replace Tensor B, reusing its memory space.
As you can see, memory planning needs to consider the order of layer execution order and the tensor access.
In the example of the hypothetical model below, it originally used 600 units of memory, but after memory planning, the memory usage is reduced by 50% to 300 units.
Memory on a Diet!
Optimium has been utilizing Memory Planning to provide more efficient memory usage. When comparing memory usage with Memory Planning enabled and disabled, as shown in the graph below, we can see a reduction in memory usage ranging from a minimum of 76% to a maximum of 92%. (Results may vary depending on the model.)
Furthermore, through various memory optimizations, we aim not only to reduce memory usage but also to enhance performance. We achieved up to 1.55 times acceleration on the AMD64 platform, and up to 1.34 times acceleration on the ARM64 platform.
Conclusion
In this post, we briefly explored how Optimium’s runtime reduces memory usage and accelerates execution speed. In reality, runtime performs many other tasks besides memory planning. Currently, Optimium team is actively developing various features such as remote API for user convenience and parallel execution for acceleration. We plan to upload more posts about the optimization techniques applied to Optimium in the future, so please stay tuned and thank you for your interest!