Technology
This time, we will examine how to utilize the GPU using OpenCL. OpenCL is an open parallel computing framework created and managed by the Khronos Group.
Jinhwan Shin
2024년 8월 2일
Hello! This is Jinhwan Shin again and as I’ve mentioned in my previous blog I am currently developing Runtime at ENERZAi. In the previous article, we explored how to utilize the GPU for computation using Vulkan Compute Shader. This time, we will examine how to utilize the GPU using OpenCL. OpenCL is an open parallel computing framework created and managed by the Khronos Group. Unlike the Vulkan Compute Shader, implemented on top of a Graphics Library, OpenCL is a GPGPU library designed specifically for general-purpose GPU computing. It performs the same role as NVIDIA’s CUDA but differs in that, unlike CUDA, which supports only GPUs, OpenCL is designed to allow programming for accelerators such as CPUs or NPUs. Being an open technical standard, any vendor who wants to support it can do so, which is an advantage.
As with the previous article on Vulkan Shader, I will first introduce the terms and concepts needed to understand OpenCL and then explain how it can be practically used through brief code examples.
OpenCL Runtime
The OpenCL runtime supports the execution of compiled code on the GPU through the flow shown above. Below are detailed functions organized by key terms for your reference.
Platform
A Platform in OpenCL is a unit that groups a Device, which performs calculations, and a Host that manages the Device. Simply put, one OpenCL implementation can be considered a Platform.
Therefore, if a computer has an Intel CPU, an integrated graphics card, and two NVIDIA external graphics cards installed, it can be seen as having three Platforms as shown below. (Generally, Intel CPU platforms and GPU platforms are considered separate Platforms.)
Intel CPU Platform: Intel CPU Device
Intel GPU Platform: Intel GPU Device
NVIDIA GPU Platform: NVIDIA GPU Device & NVIDIA GPU Device
Device
A Device in OpenCL is a unit that performs calculations. This unit can be implemented in various forms, such as CPU, GPU, or DSP, depending on the implementation method. In OpenCL, Device types are defined using the following Device type constants:
CL_DEVICE_TYPE_CPU
: CPUCL_DEVICE_TYPE_GPU
: GPUCL_DEVICE_TYPE_ACCELERATOR
: AI Accelerator such as Hexagon DSP, GNA, TPUCL_DEVICE_TYPE_CUSTOM
: Other Devices
A Device is composed of multiple Compute units that perform calculations, similar to CPU cores, and each Compute unit is also made up of several Processing elements. For example,
Intel i5–12400 consists of 12 Compute units, with each Compute unit consisting of 1 Processing element.
NVIDIA RTX 3060 consists of 28 Compute units, with each Compute unit consisting of 128 Processing elements.
Work Group & Work Item
A Work item is a unit of work performed by a single Processing element. Simply put, it’s like a single core executing a function. Multiple Work Items are grouped to form a Work Group, and the collection of Work Groups completes the total computation.
There are limits to the size of a Work Group that can be executed simultaneously. This information can be queried using the clGetDeviceInfo
function, with the following tags:
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS
: Indicates the maximum number of dimensions for work items available on the Device.CL_DEVICE_MAX_WORK_ITEM_SIZES
: Indicates the maximum number of work items per dimension available on the Device.CL_DEVICE_MAX_WORK_GROUP_SIZE
: Indicates the maximum size of a work group available on the Device. The product of all dimension sizes in a work group cannot exceed this value.
Context
A Context in OpenCL is an object that manages resources such as Buffers, Programs, and Command Queues. You can create a Context using clCreateContext
or create it along with selecting Platform and Device using clCreateContextFromType
.
Command Queue
A Command Queue is the gateway for all commands that interact with a device, such as reading/writing to a buffer or executing a kernel. Functions in the clEnqueue~
family insert commands into the command queue. You can create a command queue using clCreateCommandQueueWithProperties
.
By default, OpenCL executes one command at a time (Sequential). For better performance, you can enable parallel execution using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
For details related to synchronization, please refer to the section on Events described later.
Buffer
A Buffer in OpenCL is an object that represents memory. Depending on the type of data it holds, it is classified as follows:
Buffer: Can hold any type of data in a 1D array. (From scalar types like
int
andfloat
to vector types or structures likefloat4
) Create withclCreateBuffer
. You can useclCreateSubBuffer
to use a part of the buffer.Image: Represents 1D/2D/3D images, typically used for texture operations. Created with
clCreateImage
.Pipe: sed to store small data in FIFO; not directly accessible by the host but could be accessed with
read_pipe
andwrite_pipe
at the device. Created withclCreatePipe
.
Depending on where the memory resides, it is classified as follows:
1. Host Memory
Memory on the host device.
2. Device Memory
Global Memory: It is the memory of the device. Simply think of it as GPU memory. It is mapped to the host, allowing the host to read and write. This memory is also shared among all workgroups.
Local Memory: It is the memory of the device, but the host cannot access it. This memory is shared among same workgroups, but not between different workgroups.
Private Memory: It is the memory of the device, but the host cannot access it. It is the dedicated memory of the Processing Element, accessible only by the Processing Element itself.
When creating a buffer, you can set various options based on its intended use:
CL_MEM_READ_WRITE
: Creates a buffer readable and writable by the device.CL_MEM_READ_ONLY
: Creates a buffer readable only by the device.CL_MEM_WRITE_ONLY
: Creates a buffer writable only by the device.CL_MEM_USE_HOST_PTR
: Maps the given host memory for device access.CL_MEM_ALLOC_HOST_PTR
: Allocates host memory and maps it for device access.CL_MEM_COPY_HOST_PTR
: Creates a buffer and copies host memory content when needed.CL_MEM_HOST_WRITE_ONLY
: Creates a buffer writable only by the host.CL_MEM_HOST_READ_ONLY
: Creates a buffer readable only by the host.CL_MEM_HOST_NO_ACCESS
: Creates a buffer not accessible by the host.
Note
If the
CL_MEM_HOST_NO_ACCESS
option is given, the buffer is not affected by the host, providing a performance advantage. Therefore, when allocating an intermediate buffer, it is recommended to use theCL_MEM_HOST_NO_ACCESS
flag unless necessary.The
CL_MEM_USE_HOST_PTR
option might not always be zero-copy (device-mapped memory). If the alignment of both the host and device does not meet the requirements, a copy may occur. Additionally, this option is recommended by some devices (e.g., Intel) and discouraged by others (e.g., ARM). It should be used appropriately depending on the target device.There is a maximum size limit for buffers. You can query this limitation using the
clGetDeviceInfo
function with theCL_DEVICE_MAX_MEM_ALLOC_SIZE
tag.
Program
A Program in OpenCL is a collection of Kernels and functions. It’s essentially the result of compiling a source file. Multiple compiled source files can be linked to form a larger Program using clLinkProgram
.
Programs are categorized into three types:
Source: Written in OpenCL C or OpenCL C++, created with
clCreateProgramWithSource
.IL: Intermediate representation binary, created with
clCreateProgramWithIL
. Used whencl_khr_il_program
extension is supported. SPIR-V can be used given thatcl_khr_spir
extension is supported and should specifyx spir
option.Binary: Target-device dependent binary, created with
clCreateProgramWithBinary
.
Programs created from Source or IL are compiled using clCompileProgram
and linked into a larger Program using clCompileProgram
.
Kernel
A Kernel is a set of commands executed by a Processing Element. The differences from functions are as follows:
It serves as an entry point.
The keyword
kernel
or__kernel
must be present in the function.The return type must be
void
.
Kernel arguments have restrictions, queryable with clGetDeviceInfo
using the following tags:
CL_DEVICE_MAX_PARAMETER_SIZE
: Maximum size of a single parameter, excluding buffers.CL_DEVICE_GLOBAL_MEM_SIZE
: Size of global memory.CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
: Maximum size of constant buffers(pointers withconstant
or__constant
).CL_DEVICE_MAX_CONSTANT_ARGS
: Maximum number of constant buffers.CL_DEVICE_LOCAL_MEM_SIZE
: Size of local memory.
Events
An Event is a synchronization object that describes dependencies between enqueued commands. The clEnqueue~
family of functions allows you to input a wait list at the end and output an event. The wait list contains events to wait for, and the event is a signal object that indicates the command has finished. This mechanism synchronizes commands.
To create a user-defined event, use clCreateUserEvent
and signal it with clSetUserEventStatus
.
There are also two additional synchronization mechanisms: markers and barriers.
Similarities
The output event signals when all events in the wait list are signaled.
Differences
A marker allows commands after it to execute even if the wait list events are not yet signaled (signal event when events in wait list are signaled).
A barrier blocks execution of subsequent commands until all events in the wait list are signaled (block execution until events in wait list are signaled).
Markers can be used with clEnqueueMarkerWithWaitList
and barriers with clEnqueueBarrierWithWaitList
.
Extension
Extensions are functionalities added by the Khronos Group or device manufacturers beyond the basic OpenCL features.
Both functions and types depend on whether the device supports specific extensions. Therefore, you must query the device to check for extension support before using certain functionalities and types.
Here are a few features that might not be usable because they are extensions.
cl_khr_fp16
: Supports float16 operations.cl_khr_fp64
: Supports float64 operations.cl_khr_il_program
,cl_khr_spir
: Supports SPIR-V IR.cl_khr_global_int32_base_atomics
,cl_khr_global_int32_extended_atomics
: Supports int32 atomic operations on global memory.cl_khr_local_int32_base_atomics
,cl_khr_local_int32_extended_atomics
: Supports int32 atomic operations on local memory.cl_khr_global_int64_base_atomics
,cl_khr_global_int64_extended_atomics
: Supports int64 atomic operations on global memory.cl_khr_local_int64_base_atomics
,cl_khr_local_int64_extended_atomics
: Supports int64 atomic operations on local memory.
Execute Kernel with OpenCL
To execute a kernel in OpenCL, you must first select the device that will execute the kernel. Enumerate platforms and devices to choose the target device.
Once you have selected the device, create a context and a command queue.
After creating the context and command queue, load the program.
Compile and link the loaded program.
Create a kernel from the linked program.
Create input and output buffers for the kernel.
Set the kernel arguments.
Write data to the buffer if necessary.
Execute the kernel.
Read data from the buffer if necessary.
Repeat steps 8 to 10 as needed.
Release all used resources.
So far, we have explored the concepts and usage of OpenCL. Do you feel that understanding it is one thing, but actually performing parallel programming and optimization seems difficult and daunting? Don’t worry! We are diligently developing Optimium for people like you.
Optimium, ENERZAi’s AI inference optimization engine, performs automatic optimization and currently supports both single-thread and multi-thread execution on AMD64 and Arm CPUs. It also accelerates not only AI models but also preprocessing tasks, with GPU support coming soon! If you’re interested in Optimium or have inquiries about our AI inference optimization technology, feel free to contact us at contact@enerzai.com or visit our LinkedIn(www.linkedin.com/company/enerzai) page! Hope to see you again 🙂