OpenCL, the Open Standard for Parallel Computing

Optimium

Solutions

Company

Resources

Contact

Select Language

Optimium

Solutions

Company

Resources

Contact

Select Language

Technology

OpenCL, the Open Standard for Parallel Computing

This time, we will examine how to utilize the GPU using OpenCL. OpenCL is an open parallel computing framework created and managed by the Khronos Group.

Jinhwan Shin

2024년 8월 2일

Hello! This is Jinhwan Shin again and as I’ve mentioned in my previous blog I am currently developing Runtime at ENERZAi. In the previous article, we explored how to utilize the GPU for computation using Vulkan Compute Shader. This time, we will examine how to utilize the GPU using OpenCL. OpenCL is an open parallel computing framework created and managed by the Khronos Group. Unlike the Vulkan Compute Shader, implemented on top of a Graphics Library, OpenCL is a GPGPU library designed specifically for general-purpose GPU computing. It performs the same role as NVIDIA’s CUDA but differs in that, unlike CUDA, which supports only GPUs, OpenCL is designed to allow programming for accelerators such as CPUs or NPUs. Being an open technical standard, any vendor who wants to support it can do so, which is an advantage.

As with the previous article on Vulkan Shader, I will first introduce the terms and concepts needed to understand OpenCL and then explain how it can be practically used through brief code examples.

OpenCL Runtime

The OpenCL runtime supports the execution of compiled code on the GPU through the flow shown above. Below are detailed functions organized by key terms for your reference.

Platform

A Platform in OpenCL is a unit that groups a Device, which performs calculations, and a Host that manages the Device. Simply put, one OpenCL implementation can be considered a Platform.

Therefore, if a computer has an Intel CPU, an integrated graphics card, and two NVIDIA external graphics cards installed, it can be seen as having three Platforms as shown below. (Generally, Intel CPU platforms and GPU platforms are considered separate Platforms.)

Intel CPU Platform: Intel CPU Device
Intel GPU Platform: Intel GPU Device
NVIDIA GPU Platform: NVIDIA GPU Device & NVIDIA GPU Device

Device

A Device in OpenCL is a unit that performs calculations. This unit can be implemented in various forms, such as CPU, GPU, or DSP, depending on the implementation method. In OpenCL, Device types are defined using the following Device type constants:

CL_DEVICE_TYPE_CPU : CPU
CL_DEVICE_TYPE_GPU : GPU
CL_DEVICE_TYPE_ACCELERATOR : AI Accelerator such as Hexagon DSP, GNA, TPU
CL_DEVICE_TYPE_CUSTOM : Other Devices

A Device is composed of multiple Compute units that perform calculations, similar to CPU cores, and each Compute unit is also made up of several Processing elements. For example,

Intel i5–12400 consists of 12 Compute units, with each Compute unit consisting of 1 Processing element.
NVIDIA RTX 3060 consists of 28 Compute units, with each Compute unit consisting of 128 Processing elements.

Work Group & Work Item

A Work item is a unit of work performed by a single Processing element. Simply put, it’s like a single core executing a function. Multiple Work Items are grouped to form a Work Group, and the collection of Work Groups completes the total computation.

There are limits to the size of a Work Group that can be executed simultaneously. This information can be queried using the clGetDeviceInfo function, with the following tags:

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : Indicates the maximum number of dimensions for work items available on the Device.
CL_DEVICE_MAX_WORK_ITEM_SIZES : Indicates the maximum number of work items per dimension available on the Device.
CL_DEVICE_MAX_WORK_GROUP_SIZE : Indicates the maximum size of a work group available on the Device. The product of all dimension sizes in a work group cannot exceed this value.

Context

A Context in OpenCL is an object that manages resources such as Buffers, Programs, and Command Queues. You can create a Context using clCreateContext or create it along with selecting Platform and Device using clCreateContextFromType.

Command Queue

A Command Queue is the gateway for all commands that interact with a device, such as reading/writing to a buffer or executing a kernel. Functions in the clEnqueue~ family insert commands into the command queue. You can create a command queue using clCreateCommandQueueWithProperties.

By default, OpenCL executes one command at a time (Sequential). For better performance, you can enable parallel execution using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

For details related to synchronization, please refer to the section on Events described later.

Buffer

A Buffer in OpenCL is an object that represents memory. Depending on the type of data it holds, it is classified as follows:

Buffer: Can hold any type of data in a 1D array. (From scalar types like int and float to vector types or structures like float4) Create with clCreateBuffer. You can use clCreateSubBuffer to use a part of the buffer.
Image: Represents 1D/2D/3D images, typically used for texture operations. Created with clCreateImage.
Pipe: sed to store small data in FIFO; not directly accessible by the host but could be accessed with read_pipe and write_pipe at the device. Created with clCreatePipe.

Depending on where the memory resides, it is classified as follows:

1. Host Memory

Memory on the host device.

2. Device Memory

Global Memory: It is the memory of the device. Simply think of it as GPU memory. It is mapped to the host, allowing the host to read and write. This memory is also shared among all workgroups.
Local Memory: It is the memory of the device, but the host cannot access it. This memory is shared among same workgroups, but not between different workgroups.
Private Memory: It is the memory of the device, but the host cannot access it. It is the dedicated memory of the Processing Element, accessible only by the Processing Element itself.

When creating a buffer, you can set various options based on its intended use:

CL_MEM_READ_WRITE: Creates a buffer readable and writable by the device.
CL_MEM_READ_ONLY: Creates a buffer readable only by the device.
CL_MEM_WRITE_ONLY: Creates a buffer writable only by the device.
CL_MEM_USE_HOST_PTR: Maps the given host memory for device access.
CL_MEM_ALLOC_HOST_PTR: Allocates host memory and maps it for device access.
CL_MEM_COPY_HOST_PTR: Creates a buffer and copies host memory content when needed.
CL_MEM_HOST_WRITE_ONLY: Creates a buffer writable only by the host.
CL_MEM_HOST_READ_ONLY: Creates a buffer readable only by the host.
CL_MEM_HOST_NO_ACCESS: Creates a buffer not accessible by the host.

Note
If the CL_MEM_HOST_NO_ACCESS option is given, the buffer is not affected by the host, providing a performance advantage. Therefore, when allocating an intermediate buffer, it is recommended to use the CL_MEM_HOST_NO_ACCESS flag unless necessary.
The CL_MEM_USE_HOST_PTR option might not always be zero-copy (device-mapped memory). If the alignment of both the host and device does not meet the requirements, a copy may occur. Additionally, this option is recommended by some devices (e.g., Intel) and discouraged by others (e.g., ARM). It should be used appropriately depending on the target device.
There is a maximum size limit for buffers. You can query this limitation using the clGetDeviceInfo function with the CL_DEVICE_MAX_MEM_ALLOC_SIZE tag.

Program

A Program in OpenCL is a collection of Kernels and functions. It’s essentially the result of compiling a source file. Multiple compiled source files can be linked to form a larger Program using clLinkProgram.

Programs are categorized into three types:

Source: Written in OpenCL C or OpenCL C++, created with clCreateProgramWithSource.
IL: Intermediate representation binary, created with clCreateProgramWithIL. Used when cl_khr_il_program extension is supported. SPIR-V can be used given that cl_khr_spir extension is supported and should specify x spir option.
Binary: Target-device dependent binary, created with clCreateProgramWithBinary.

Programs created from Source or IL are compiled using clCompileProgram and linked into a larger Program using clCompileProgram.

Kernel

A Kernel is a set of commands executed by a Processing Element. The differences from functions are as follows:

It serves as an entry point.
The keyword kernel or __kernel must be present in the function.
The return type must be void.

Kernel arguments have restrictions, queryable with clGetDeviceInfo using the following tags:

CL_DEVICE_MAX_PARAMETER_SIZE: Maximum size of a single parameter, excluding buffers.
CL_DEVICE_GLOBAL_MEM_SIZE: Size of global memory.
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: Maximum size of constant buffers(pointers with constant or __constant).
CL_DEVICE_MAX_CONSTANT_ARGS: Maximum number of constant buffers.
CL_DEVICE_LOCAL_MEM_SIZE: Size of local memory.

Events

An Event is a synchronization object that describes dependencies between enqueued commands. The clEnqueue~ family of functions allows you to input a wait list at the end and output an event. The wait list contains events to wait for, and the event is a signal object that indicates the command has finished. This mechanism synchronizes commands.

To create a user-defined event, use clCreateUserEvent and signal it with clSetUserEventStatus.

There are also two additional synchronization mechanisms: markers and barriers.

Similarities

The output event signals when all events in the wait list are signaled.

Differences

A marker allows commands after it to execute even if the wait list events are not yet signaled (signal event when events in wait list are signaled).
A barrier blocks execution of subsequent commands until all events in the wait list are signaled (block execution until events in wait list are signaled).

Markers can be used with clEnqueueMarkerWithWaitList and barriers with clEnqueueBarrierWithWaitList.

Extension

Extensions are functionalities added by the Khronos Group or device manufacturers beyond the basic OpenCL features.

Both functions and types depend on whether the device supports specific extensions. Therefore, you must query the device to check for extension support before using certain functionalities and types.

Here are a few features that might not be usable because they are extensions.

cl_khr_fp16: Supports float16 operations.
cl_khr_fp64: Supports float64 operations.
cl_khr_il_program, cl_khr_spir: Supports SPIR-V IR.
cl_khr_global_int32_base_atomics, cl_khr_global_int32_extended_atomics: Supports int32 atomic operations on global memory.
cl_khr_local_int32_base_atomics, cl_khr_local_int32_extended_atomics: Supports int32 atomic operations on local memory.
cl_khr_global_int64_base_atomics, cl_khr_global_int64_extended_atomics: Supports int64 atomic operations on global memory.
cl_khr_local_int64_base_atomics, cl_khr_local_int64_extended_atomics: Supports int64 atomic operations on local memory.

Execute Kernel with OpenCL

To execute a kernel in OpenCL, you must first select the device that will execute the kernel. Enumerate platforms and devices to choose the target device.

// Get platform IDs
std::vector<cl_platform_id> platforms;
cl_uint num_platforms = 0;

cl_device device;

clGetPlatformIDs(0, nullptr, &num_platforms);

platforms.resize(num_platforms);
clGetPlatformIDs(num_platforms, platforms.data(), &num_platforms);

// Get device IDs per platform
for (auto platform : platforms) {
  std::vector<cl_device_id> devices;
  cl_uint num_devices = 0;

  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, nullptr, &num_devices);
  
  devices.resize(num_devices);
  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, num_devices, devices.data(), &num_devices);

  // select proper device
}

Once you have selected the device, create a context and a command queue.

cl_context_properties context_properties[] = {
  CL_CONTEXT_PLATFORM, platform,
  0
};
cl_queue_properties queue_properties[] = {
  0
};

cl_int errcode;
cl_context context = clCreateContext(
  context_properties, 1, &device, nullptr, nullptr, &errcode
);
cl_command_queue queue = clCreateCommandQueueWithProperties(
  context, device, queue_properties, &errcode
);

After creating the context and command queue, load the program.

std::string source = R"(...)";
const char* c_str = source.c_str();
cl_int errcode = CL_SUCCESS;

// Create program with OpenCL C
cl_program c_program = clCreateProgramWithSource(context, 1, &c_str, source.length(), &errcode);

// Create program with SPIR-V IR
std::vector<uint8_t> ir;
cl_program ir_program = clCreateProgramWithIL(context, reinterpret_cast<const void*>(ir.data()), ir.size(), &errcode);

// Create program with device-specific binary
std::vector<uint8_t> binary;
size_t length = binary.size();
const uint8_t* binary_ptr = binary.data();
cl_int status = 0;
cl_program binary_program = clCreateProgramWithBinary(context, 1, &device, &length, &binary_ptr, &status, &errcode);

Compile and link the loaded program.

// Compile program
clCompileProgram(c_program, 1, &device, "-cl-std=CL2.0", 0, nullptr, nullptr, nullptr, nullptr);
clCompileProgram(ir_program, 1, &device, nullptr, 0, nullptr, nullptr, nullptr, nullptr);
// no need to compile binary program; it is already compiled.

// Link program
cl_program programs[] = { c_program, ir_program, binary_program };
cl_program linked_program = clLinkProgram(context, 1, &device, nullptr, 3, programs, nullptr, nullptr, nullptr);

Create a kernel from the linked program.

cl_int errcode;
cl_kernel kernel = clCreateKernel(linked_program, "conv2d", &errcode);

// can gather all kernels in program at once.
std::vector<cl_kernel> kernels;
cl_uint num_kernels;

clCreateKernelsInProgram(linked_program, 0, nullptr, &num_kernels);

kernels.resize(num_kernels);
clCreateKernelsInProgram(linked_program, num_kernels, kernels.data(), &num_kernels);

Create input and output buffers for the kernel.

cl_int errcode;
cl_mem input_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, INPUT_SIZE, nullptr, &errcode);
cl_mem output_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY|CL_MEM_HOST_READ_ONLY, OUTPUT_SIZE, nullptr, &errcode);

Set the kernel arguments.

// __kernel void conv2d(__global const float* input, __global float* output);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &input_buffer);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &output_buffer);

// __kernel void DoSomeMath(__global const float4* input, __global float4* output, float4 alpha, float beta);
cl_float4 alpha = { 1.0f, 2.0f, 3.0f, 4.0f };
cl_float beta = -1.0f;

clSetKernelArg(kernel, 0, sizeof(cl_mem), &input_buffer);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &output_buffer);
clSetKernelArg(kernel, 2, sizeof(cl_float4), &alpha);
clSetKernelArg(kernel, 3, sizeof(cl_float), &beta);

Write data to the buffer if necessary.

// using clEnqueueWriteBuffer
std::vector<uint8_t> data;
clEnqueueWriteBuffer(queue, input_buffer, CL_TRUE, 0, INPUT_SIZE, data.data(), 0, nullptr, nullptr);

// can be asynchronous, but unsupported devices may present(AMD, etc.)
cl_event event;
clEnqueueWriteBuffer(queue, input_buffer, CL_FALSE, 0, INPUT_SIZE, data.data(), N, barrier, &event);
clWaitForEvents(1, &event);

// using clEnqueueMapBuffer
cl_int errcode;
void* buffer = clEnqueueMapBuffer(queue, input_buffer, CL_TRUE, CL_MAP_WRITE, 0, INPUT_SIZE, 0, nullptr, nullptr, &errcode);
memcpy(buffer, data.data(), data.size() * sizeof(uint8_t));

// can be asychronous, but unsupported devices may present(AMD, etc.)
void* buffer = clEnqueueMapBuffer(queue, input_buffer, CL_FALSE, CL_MAP_WRITE, 0, INPUT_SIZE, N, barrier, &event, &errcode);
clWaitForEvents(1, &event);
memcpy(buffer, data.data(), data.size() * sizeof(uint8_t));

// Must be unmapped after map buffer.
clEnqueueUnmapMemObject(queue, input_buffer, buffer, 0, nullptr, &event);
clWaitForEvents(1, &event);

Execute the kernel.

cl_event event;
cl_uint work_dim = 1;
size_t global_worksize[1] = { 1024 };
size_t local_worksize[1] = { 64 };
clEnqueueNDRangeKernel(queue, kernel, work_dim, nullptr, global_worksize, local_worksize, 0, nullptr, &event);
clWaitForEvents(1, &event);

Read data from the buffer if necessary.

// using clEnqueueReadBuffer
std::vector<uint8_t> data;
clEnqueueWriteBuffer(queue, input_buffer, CL_TRUE, 0, OUTPUT_SIZE, data.data(), 0, nullptr, nullptr);

// can be asynchronous, but unsupported devices may present(AMD, etc.)
cl_event event;
clEnqueueWriteBuffer(queue, input_buffer, CL_FALSE, 0, OUTPUT_SIZE, data.data(), N, barrier, &event);
clWaitForEvents(1, &event);

// using clEnqueueMapBuffer
cl_int errcode;
void* buffer = clEnqueueMapBuffer(queue, input_buffer, CL_TRUE, CL_MAP_READ, 0, OUTPUT_SIZE, 0, nullptr, nullptr, &errcode);
memcpy(data.data(), buffer, data.size() * sizeof(uint8_t));

// can be asychronous, but unsupported devices may present(AMD, etc.)
void* buffer = clEnqueueMapBuffer(queue, input_buffer, CL_FALSE, CL_MAP_READ, 0, OUTPUT_SIZE, N, barrier, &event, &errcode);
clWaitForEvents(1, &event);
memcpy(data.data(), buffer, data.size() * sizeof(uint8_t));

// Must be unmapped after map buffer.
clEnqueueUnmapMemObject(queue, input_buffer, buffer, 0, nullptr, &event);
clWaitForEvents(1, &event);

Repeat steps 8 to 10 as needed.

Release all used resources.

clReleaseKernel(kernel);
clReleaseProgram(linked_program);
clReleaseProgram(binary_program);
clReleaseProgram(ir_program);
clReleaseProgram(c_program);
clReleaseCommandQueue(queue);
clReleaseContext(context);

So far, we have explored the concepts and usage of OpenCL. Do you feel that understanding it is one thing, but actually performing parallel programming and optimization seems difficult and daunting? Don’t worry! We are diligently developing Optimium for people like you.

Optimium, ENERZAi’s AI inference optimization engine, performs automatic optimization and currently supports both single-thread and multi-thread execution on AMD64 and Arm CPUs. It also accelerates not only AI models but also preprocessing tasks, with GPU support coming soon! If you’re interested in Optimium or have inquiries about our AI inference optimization technology, feel free to contact us at contact@enerzai.com or visit our LinkedIn(www.linkedin.com/company/enerzai) page! Hope to see you again 🙂

Optimium

Solutions

Company

Resources

ENERZAi

사업자등록번호: 246-86-01405