Company

Resources

Company

Resources

Technology

Vulkan Compute Shader — the core of GPU code execution

Vulkan Compute Shader — the core of GPU code execution

Vulkan Compute Shader — the core of GPU code execution

In this post, we will explore the Vulkan Compute Shader, which performs computations using Vulkan, the open standard for Graphics. First, we will explain the concepts necessary for using Vulkan and then will demonstrate a brief usage example with code.

Jin-Hwan Shin

2024년 7월 22일

Hello, this is Jin-Hwan Shin, and I am developing runtime at ENERZAi. As previously introduced in other posts, our proprietary programming language, Nadya, currently supports only CPU. However, we are conducting research and development to expand its coverage to GPU. Unlike the CPU, where code can be compiled and executed immediately, the GPU requires a separate library to support code execution. Well-known libraries including DirectX, Metal, and Vulkan fall into this category.

In addition to these Graphics Libraries such as DirectX and Vulkan, there are also GPGPU Libraries for computing, including CUDA and OpenCL. As GPUs are increasingly utilized for general computations, Graphics Libraries have also started supporting GPGPU by adding features such as DirectX Compute Shader (formerly DirectCompute), Metal Performance Shader, and Vulkan Compute Shader.

In this post, we will explore the Vulkan Compute Shader, which performs computations using Vulkan, the open standard for Graphics. First, we will explain the concepts necessary for using Vulkan and then will demonstrate a brief usage example with code. We hope this will be helpful to many of you.

Vulkan Compute Shader

Vulkan runtime supports the execution of compiled code on the GPU through the flow described above. Detailed features are organized below by important keywords for your reference

Instance

Now, unlike OpenGL which uses global state, Vulkan supports the storage of application-specific states. Accordingly, the object that stores the state for each application is the vkInstance object.

Physical Device & Logical Device

In Vulkan, devices are divided into physical devices (vkPhysicalDevice) and logical devices (vkDevice).

A physical device represents one Vulkan implementation (it can be considered as one GPU, similar to the concept of a Platform in OpenCL), and a logical device is an object instantiated from a physical device, each having unique resources and states. You can enumerate physical devices using the vkEnumeratePhysicalDevice function and retrieve information about a physical device using the vkGetPhysicalDeviceProperties function.

A queue is a conduit for processing commands. Multiple queues can exist, and queues that perform similar roles are grouped into queue families. One queue family can handle multiple types of commands, and multiple queue families can handle one type of command. The types of target commands are as follows:

  • Video Decode

  • Video Encode

  • Graphics

  • Compute

  • Transfer

  • Memory management

Information about queue families of a physical device can be obtained using the vkGetPhysicalDeviceQueueFamilyProperties function.

Once you find an appropriate physical device using the above functions, you can create a logical device from that physical device, and this task is performed using the vkCreateDevice function.

Buffer & Memory

Buffer

In Vulkan, buffer and memory are managed separately. A buffer is just a view of memory, and the actual memory is managed separately. Therefore, in Vulkan, you need to bind memory to buffers.

  • vkBuffer objects can be created using the vkCreateBuffer function.

  • Memory can be bound to buffer objects using the vkBindBufferMemory function.

There are various types of buffers that can be used in shaders, but the following three types are most commonly used in compute shaders:

  1. Storage Buffer

  • A buffer capable of reading and writing large amounts of data. Suitable for storing information such as tensors, weights, and biases.

2. Uniform Buffer

  • A buffer suitable for reading small amounts of data. Often used to pass parameters to the kernel.

3. Push Constant Buffer

  • Similar to the Uniform Buffer but used slightly differently.

Memory

Unlike OpenCL or OpenGL, in Vulkan, device memory allocation must be handled by the application. In OpenCL and OpenGL, you only need to specify the size and access method of the desired memory, and it is automatically allocated. However, in Vulkan, you must enumerate the memory heap of the device, find the available memory area, and perform the complex task of allocating to that area.

Device memory is classified as follows based on whether it can be accessed by the device and host:

  1. Device-local

  • Memory accessible only by the device.

2. Device-local, Host-visible

  • Device memory that is also accessible by the host.

3. Host-local, Host-visible

  • Host memory that is also accessible by the device.

Depending on the device, these memories may be separated, or one memory can deal with all the tasks. You can achieve higher performance if memory is allocated to the appropriate area that corresponds to the purpose of the buffer.

Information about the memory of a physical device can be obtained using the vkGetPhysicalDeviceMemoryProperties function, and device memory can be allocated using the vkAllocateMemory function with the memory type index obtained from this function. Additionally, in Vulkan, there is a limit on the maximum number of memory allocations, so memory can only be allocated up to VkPhysicalDeviceLimits::maxMemoryAllocationCount.

If necessary, you may need to allocate one bulk memory and then bind a portion of the memory to buffers when creating buffers.

Pipeline

A pipeline is a sequence of processes performed by a device. It serves as a document that describes various information, such as which resources will be used, what stages exist, and which shaders are used in those stages.

The compute pipeline omits several processes compared to the graphics pipeline. While the graphics pipeline involves many stages, such as vertex shader, geometry shader, pixel shader, and ray tracing, the compute pipeline just performs computations on input buffer values and writes the result to the output buffer. Hence, it has only one stage, the compute shader.

Therefore, once the compute shader is determined, a compute pipeline can be created using the vkCreateComputePipelines function.

Descriptor

A descriptor is an object that represents resources used by shaders. Descriptors allow shaders to bind the resources they use.

Descriptors are classified into Descriptor Pool, Descriptor Set Layout, and Descriptor Set as follows:

  1. Descriptor Pool

  • A pool for allocating Descriptor Sets. It acts as a memory allocator.

2. Descriptor Set Layout

  • Describes the structure of the Descriptor Set. It can be shared among multiple shaders using the same layout.

3. Descriptor Set

  • An object that contains actual information. It contains information such as which actual buffer each buffer in the set is mapped to and the offset.

Additionally, there are limitations on the resources that can be used when creating Descriptor Sets. Related information can be obtained through the VkPhysicalDeviceLimits structure.

  1. VkPhysicalDeviceLimits::maxPushConstantSize

  • The maximum size of the Push Constant Buffer.

2. VkPhysicalDeviceLimits::maxPerStageDescriptorUniformBuffers

  • The maximum number of uniform buffers that can be used in a single stage.

3. VkPhysicalDeviceLimits::maxPerStageDescriptorStorageBuffers

  • The maximum number of storage buffers that can be used in a single stage.

4. VkPhysicalDeviceLimits::maxPerStageResources

  • The maximum number of all resources that can be used in a single stage.

5. VkPhysicalDeviceLimits::maxDescriptorSetUniformBuffers

  • The maximum number of uniform buffers that can be used in a single Descriptor Set.

  1. VkPhysicalDeviceLimits::maxDescriptorSetStorageBuffers

  • The maximum number of storage buffers that can be used in a single Descriptor Set.

Command Pool & Command Buffer

A command buffer is a buffer that records commands to be executed by the GPU. Commands are written to this buffer and then submitted to the queue all at once. And command pool is a memory allocator that allocates command buffers to memory. After writing commands to a command buffer only once (allocated to memory through the command pool), the recorded commands can be repeatedly submitted, efficiently executing commands.

In a compute shader, command buffers record commands using the vkCmd~ series of functions between vkBeginCommandBuffer and vkEndCommandBuffer.

  • vkCmdBindDescriptorSets: Binds the descriptor set to be used. The bound buffer will be used from this point onwards.

  • vkCmdDispatch: Executes the compute shader.

  • vkCmdCopyBuffer: Copies buffers.

  • vkCmdPipelineBarrier: Inserts a barrier between pipelines. Memory barriers can also be inserted using this function.

  • vkCmdPushConstants: Writes values to the Push Constant Buffer.

  • vkCmdSetEvent: Raises an event.

  • vkCmdResetEvent: Resets an event.

  • vkCmdWaitEvents: Waits for an event to occur.

If you want to clear and reuse the commands in a command buffer, you can initialize using the vkResetCommandBuffer function before vkBeginCommandBuffer to reuse them.

The constraints when executing a compute shader are as follows:

  1. VkPhysicalDeviceLimits::maxComputeSharedMemorySize

  • The maximum size of the shared memory (local memory in OpenCL).

2. VkPhysicalDeviceLimits::maxComputeWorkGroupCount

  • The maximum global workgroup size.

3. VkPhysicalDeviceLimits::maxComputeWorkGroupSize

  • The maximum local workgroup size.

4. VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations

  • The maximum size that can be invoked in a local workgroup. The product of all dimension values of the local workgroup size must not exceed this value.

Fences, Semaphores, Events, and Barriers

Fence

A fence is an object used for synchronization between the host and the command queue, supporting tasks such as waiting for the completion of enqueued commands. If a fence is submitted along with the command when submitting the recorded commands in the command buffer to the queue with the vkQueueSubmit function, you can wait until the command is finished.

It is created using the vkCreateFence function, and you can instruct to wait until one or more fences are signaled using the vkWaitForFences function.

Semaphore

A semaphore is a synchronization object used to insert dependencies between command queues. It supports tasks such as waiting until a specific queue is finished or executing commands after a specific queue is finished. A semaphore is created using the vkCreateSemaphore function, and dependencies are set when submitting commands using the vkQueueSubmit function.

Event

An event is a synchronization object used to insert dependencies between commands. It can only be used between commands in the same queue. An event supports tasks such as waiting until a specific command is finished or executing commands after a specific command is executed. It is created using the vkCreateEvent function, and unlike other synchronization objects, it supports synchronization not only within the device but also between device ↔ host. The host can signal using vkSetEvent and vkResetEvent, and the device can signal using vkCmdSetEvent and vkCmdResetEvent when recording commands in the command buffer.

Barrier

A barrier is a synchronization object used to insert dependencies between commands. It performs a similar role to an event but can only be used within the device and works between queues. Additionally, it is used for memory synchronization and records commands in the command buffer using the vkCmdPipelineBarrier function.

Execute Kernel with Vulkan

  • Template compute shader code

layout(set=0, binding=0) readonly buffer Input {
    float input_tensor[];
} input;

layout(set=0, binding=1) writeonly buffer Output {
    float output_tensor[];
} output;

layout(set=0, binding=2) uniform UniformArgs {
 // ...
} uniform_args;

layout(push_constant) uniform PushArgs {
 // ...
} push_args;

void main() {
    // do something...
  1. Create an instance.

VkApplicationInfo appInfo{};
appInfo.sType = VK_STRUCTURE_TYPE_APPLIATION_INFO;
appInfo.pApplicationName = "Hello World App";
appInfo.applicationVersion = VK_MAKE_VERSION(0, 0, 1);
appInfo.pEngineName = "No Engine";
appInfo.engineVersion = VK_MAKE_VERSION(0, 0, 1);
appInfo.apiVersion = VK_API_VERSION_1_2;

VkInstanceCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;
createInfo.pApplicationInfo = &appInfo;

// you can add validation layer if needed. But it is not recommended since
// it can cause performance drop.
const char* layers[1] = {
   "VK_LAYERS_KHRONOS_validation"
};
VkValidationFeatureEnableEXT enable_features[] = {
  VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT,
  VK_VALIDATION_FEATURE_ENABLE_BEST_PRACTICE_EXT
};
VkValidationFeaturesEXT validation_features{};
validation_features.sType = VK_STRUCTURE_TYPE_VALIDATION_FEATURES_EXT;
validation_features.enabledValidationFeaturesCount = 2;
validation_features.pEnabledValidationFeatures = enable_features;

createInfo.enabledLayerCount = 1;
createInfo.ppEnabledLayerNames = layers;
createInfo.pNext = &validation_features;

VkDebugReportCallbackCreateInfo debug_callback_create_info{};
debug_callback_create_info.sType = VK_STRUCTURE_TYPE_DEBUG_REPORT_CALLBACK_CREATE_INFO_EXT;
debug_callback_create_info.flags = VK_DEBUG_REPORT_INFORMATION_BIT_EXT|VK_DEBUG_REPORT_WARNING_BIT_EXT|VK_DEBUG_REPORT_PERFORMANCE_WARNING_BIT_EXT|VK_DEBUG_REPORT_ERROR_BIT_EXT|VK_DEBUG_REPORT_DEBUG_BIT_EXT;
debug_callback_create_info.pfnCallback = DebugCallback;

VkDebugReportCallbackEXT debug_callback;
vkCreateDebugReportCallbackEXT(instance, &debug_callback_create_info, nullptr, &debug_callback);
// end of validation layer

VkInstance instance;
vkCreateInstance(&createInfo, nulptr, &instance);

2. Search for an appropriate physical device.

std::vector<VkPhysicalDevice> physical_devices;
uint32_t device_count = 0;
vkEnumeratePhysicalDevices(instance, &device_count, nullptr);

physical_devices.resize(device_count);
vkEnumeratePhysicalDevices(instance, &device_count, physical_devices.data());

VkPhysicalDevice selected_physical_device = VK_NULL_HANDLE;
for (auto physical_device : physical_devices) {
    VkPhysicalDeviceProperties device_properties;
    VkPhysicalDeviceFeatures device_features;
    VkPhysicalDeviceMemoryProperties device_memory_properties;
  std::vector<VkQueueFamilyProperties> queue_family_properties;
    uint32_t queue_family_property_count;

    vkGetPhysicalDeviceProperties(physical_device, &device_properties);
    vkGetPhysicalDeviceFeatures(physical_device, &device_features);
  vkGetPhysicalDeviceMemoryProperties(physical_device, &device_memory_properties);

  vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, nullptr);
  queue_family_properties.resize(queue_family_property_count);
  vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, queue_family_properties.data());

    // find suitable device...
    // suitable device must have queue and memory heap for compute.
  // And the device may larger memory heap for compute and/or have discrete GPU for more performance.
  // It's all up to you which device to choose.

3. Once an appropriate physical device is found, create a logical device.

// find queue family for compute
std::optional<uint32_t> queue_family_index;
std::vector<VkQueueFamilyProperties> queue_family_properties;
uint32_t queue_family_property_count;

vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, nullptr);
queue_family_properties.resize(queue_family_property_count);
vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, queue_family_properties.data());

for (auto i = 0; i < queue_family_property_count; ++i) {
    auto& properties = queue_family_properties[i];

    if (properties.queueFlags & VK_QUEUE_COMPUTE_BIT) {
        queue_family_index = i;
        break;
    }
}

assert(queue_family_index.has_value());

// create logical device
float queuePriority = 1.0f;
VkDeviceQueueCreateInfo queueCreateInfo{};
queueCreateInfo.sType = VK_STRUCTURE_DEVICE_QUEUE_CREATE_INFO;
queueCreateInfo.queueFamilyIndex = queue_family_index.value();
queueCreateInfo.queueCount = 1;
queueCreateInfo.pQueuePriorities = &queuePriority;

VkDeviceCreateInfo deviceCreateInfo{};
deviceCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;
deviceCreateInfo.pQueueCreateInfos = &queueCreateInfo;
deviceCreateInfo.queueCreateInfoCount = 1;
deviceCreateInfo.enabledExtensionCount = 0;

// you can add extension if needed.

deviceCreateInfo.enabledLayerCount = 0;

VkDevice device;
vkCreateDevice(physical_device, &deviceCreateInfo, nullptr, &device);

4. Create a queue.

VkQueue queue;
vkGetDeviceQueue(device, queue_family_index.value(), 0, &queue);

5. Create input and output buffers.

VkBufferCreateInfo input_buffer_create_info{};
VkBufferCreateInfo output_buffer_create_info{};
VkBufferCreateInfo uniform_buffer_create_info{};

input_buffer_create_info.sType = VK_SCRUCTURE_TYPE_BUFFER_CREATE_INFO;
input_buffer_create_info.size = INPUT_SIZE;
input_buffer_create_info.usage = VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
input_buffer_create_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;

output_buffer_create_info.sType = VK_SCRUCTURE_TYPE_BUFFER_CREATE_INFO;
output_buffer_create_info.size = OUTPUT_SIZE;
output_buffer_create_info.usage = VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
output_buffer_create_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;

uniform_buffer_create_info.sType = VK_SCRUCTURE_TYPE_BUFFER_CREATE_INFO;
uniform_buffer_create_info.size = sizeof(UniformArgs);
uniform_buffer_create_info.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT;
uniform_buffer_create_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;

VkBuffer input_buffer;
VkBuffer output_buffer;
VkBuffer uniform_buffer;

vkCreateBuffer(device, &input_buffer_create_info, nullptr, &input_buffer);
vkCreateBuffer(device, &output_buffer_create_info, nullptr, &output_buffer);
vkCreateBuffer(device, &uniform_buffer_create_info, nullptr, &uniform_buffer);

6. Allocate device memory for the buffers.

// Find suitable memory heap
VkPhysicalDeviceMemoryProperties device_memory_properties;
vkGetPhysicalDeviceMemoryProperties(physical_device, &device_memory_properties);




constexpr VkMemoryPropertyFlagBits required_properties =
    VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT; // allocating memory that can access to host and do not need flush
    // or
    VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT; // allocating memory that is only used in device
VkMemoryRequirements input_buffer_requirements;
VkMemoryRequirements output_buffer_requirements;
VkMemoryRequirements uniform_buffer_requirements;

vkGetBufferMemoryRequirements(device, input_buffer, &input_buffer_requirements);
vkGetBufferMemoryRequirements(device, output_buffer, &output_buffer_requirements);
vkGetBufferMemoryRequirements(device, uniform_buffer, &uniform_buffer_requirements);

auto input_memory_type_index = FindOptimalHeap(input_buffer_requirements.memoryTypeBits, required_properties);
auto output_memory_type_index = FindOptimalHeap(output_buffer_requirements.memoryTypeBits, required_properties);
auto uniform_memory_type_index = FindOptimalHeap(uniform_buffer_requirements.memoryTypeBits, required_properties);

assert(input_memory_type_index.has_value());
assert(output_memory_type_index.has_value());
assert(uniform_memory_type_index.has_value());

VkDeviceMemory input_buffer_memory;
VkDeviceMemory output_buffer_memory;
VkDeviceMemory uniform_buffer_memory;
VkMemoryAllocateInfo input_memory_allocate_info{};
VkMemoryAllocateInfo output_memory_allocate_info{};
VkMemoryAllocateInfo uniform_memory_allocate_info{};

input_memory_allocate_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
input_memory_allocate_info.allocationSize = input_buffer_requirements.size;
input_memory_allocate_info.memoryTypeIndex = input_memory_type_index.value();

output_memory_allocate_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
output_memory_allocate_info.allocationSize = output_buffer_requirements.size;
output_memory_allocate_info.memoryTypeIndex = output_memory_type_index.value();

uniform_memory_allocate_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
uniform_memory_allocate_info.allocationSize = uniform_buffer_requirements.size;
uniform_memory_allocate_info.memoryTypeIndex = uniform_memory_type_index.value();

vkAllocateMemory(device, &input_memory_allocate_info, &input_buffer_memory);
vkAllocateMemory(device, &output_memory_allocate_info, &output_buffer_memory);
vkAllocateMemory(device, &uniform_memory_allocate_info, &uniform_buffer_memory);

7. Bind the device memory to the buffers.

vkBindBufferMemory(device, input_buffer, input_buffer_memory, /*offset=*/0);
vkBindBufferMemory(device, output_buffer, output_buffer_memory, /*offset=*/0);
vkBindBufferMemory(device, uniform_buffer, uniform_buffer_memory, /*offset=*/0);

8. If necessary, map the input buffer and input the required data.

void* input_ptr;
vkMapMemory(device, input_buffer_memory, /*offset=*/0, VK_WHOLE_SIZE, 0, &input_ptr);

// do something with input_ptr

vkUnmapMemory(device, input_buffer_memory);
input_ptr = nullptr;

void* uniform_ptr;
vkMapMemory(device, uniform_buffer_memory, /*offset=*/0, VK_WHOLE_SIZE, 0, &uniform_ptr);

// do something with uniform_ptr

vkUnmapMemory(device, uniform_buffer_memory);
uniform_ptr = nullptr;

9. Create descriptor sets.

VkDescriptorSetLayoutBinding bindings[] = {
    { /*binding=*/0, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, /*descriptorCount=*/1, VK_SHADER_STAGE_COMPUTE_BIT, /*pImmutableSamplers=*/nullptr },
    { /*binding=*/1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, /*descriptorCount=*/1, VK_SHADER_STAGE_COMPUTE_BIT, /*pImmutableSamplers=*/nullptr },
    { /*binding=*/2, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, /*descriptorCount=*/1, VK_SHADER_STAGE_COMPUTE_BIT, /*pImmutableSamplers=*/nullptr }
};

VkDescriptorSetLayoutCreateInfo desc_set_layout_create_info{};
desc_set_layout_create_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
desc_set_layout_create_info.bindingCount = 3;
desc_set_layout_create_info.pBindings = bindings;

VkDescriptorSetLayout desc_set_layout;
vkCreateDescriptorSetLayout(device, &desc_set_layout_create_info, nullptr, &desc_set_layout);

VkDescriptorPoolSize desc_pool_size[] = {
    { VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 2 },
    { VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, 1 }
};

VkDescriptorPoolCreateInfo desc_pool_create_info{};
desc_pool_create_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
desc_pool_create_info.maxSets = 1; // how many sets to be allocated
desc_pool_create_info.poolSizeCount = 2;
desc_pool_create_info.pPoolSizes = &desc_pool_size;

VkDescriptorPool desc_pool;
vkCreateDescriptorPool(device, &desc_pool_create_info, nullptr, &desc_pool);

VkDescriptorSetAllocateInfo desc_set_allocate_info{};
desc_set_allocate_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
desc_set_allocate_info.descriptorPool = desc_pool;
desc_set_allocate_info.descriptorSetCount = 1;
desc_set_allocate_info.pSetLayouts = &desc_set_layout;

VkDescriptorSet desc_set;
vkAllocateDescriptorSets(device, &desc_set_allocate_info, &desc_set);

10. Connect the descriptor sets to the buffers.

VkDescriptorBufferInfo input_desc_buffer_info{};
VkDescriptorBufferInfo output_desc_buffer_info{};
VkDescriptorBufferInfo uniform_desc_buffer_info{};

input_desc_buffer_info.buffer = input_buffer;
input_desc_buffer_info.offset = 0;
input_desc_buffer_info.range = WK_WHOLE_SIZE;

output_desc_buffer_info.buffer = output_buffer;
output_desc_buffer_info.offset = 0;
output_desc_buffer_info.range = WK_WHOLE_SIZE;

uniform_desc_buffer_info.buffer = uniform_buffer;
uniform_desc_buffer_info.offset = 0;
uniform_desc_buffer_info.range = WK_WHOLE_SIZE;

VkWriteDescriptorSet write_desc_set[] = {
    { VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, nullptr, desc_set, /*dstBinding=*/0, /*dstArrayElement=*/0, /*descriptorCount=*/1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, nullptr, &input_desc_buffer_info, nullptr },
    { VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, nullptr, desc_set, /*dstBinding=*/1, /*dstArrayElement=*/0, /*descriptorCount=*/1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, nullptr, &output_desc_buffer_info, nullptr },
    { VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, nullptr, desc_set, /*dstBinding=*/2, /*dstArrayElement=*/0, /*descriptorCount=*/1, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, nullptr, &uniform_desc_buffer_info, nullptr },
};

vkUpdateDescriptorSets(device, 3, write_desc_set, 0, nullptr);

11. Load the shader.

std::vector<char> shader_code;
// load shader code

VkShaderModuleCreateInfo shader_module_create_info{};
shader_module_create_info.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
shader_module_create_info.pCode = reinterpret_cast<uint32_t*>(shader_code.data());
shader_module_create_info.codeSize = shader_code.size();

VkShaderModule shader_module;
vkCreateShaderModule(device, &shader_module_create_info, nullptr, &shader_module);

12. Create the pipeline.

VkPushConstantRange push_constant_range{};
push_constant_range.offset = 0;
push_constant_range.size = sizeof(PushArgs);
push_constant_range.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;

VkPipelineLayoutCreateInfo pipeline_layout_create_info{};
pipeline_layout_create_info.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
pipeline_layout_create_info.setLayoutCount = 1;
pipeline_layout_create_info.pSetLayouts = &desc_set_layout;
pipeline_layout_create_info.pushConstantRangeCount = 1;
pipeline_layout_create_info.pPushConstantRanges = &push_constant_range;

VkPipelineLayout pipeline_layout;
vkCreatePipelineLayout(device, &pipeline_layout_create_info, nullptr, &pipeline_layout);

VkComputePipelineCreateInfo compute_pipeline_create_info{};
compute_pipeline_create_info.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;

compute_pipeline_create_info.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
compute_pipeline_create_info.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT;
compute_pipeline_create_info.stage.module = shader_module;
compute_pipeline_create_info.stage.pName = "main";

compute_pipeline_create_info.layout = pipeline_layout;

VkPipeline pipeline;
vkCreateComputePipelines(device, VK_NULL_HANDLE, 1, &compute_pipeline_create_info, nullptr, &pipeline);

13. Create a command pool.

VkCommandPoolCreateInfo command_pool_create_info{};
command_pool_create_info.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;

// 만약 vkResetCommandBuffer를 호출할 계획이라면, 다음 flag가 반드시 있어야 합니다.
command_pool_create_info.flags |= VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT;

command_pool_create_info.queueFamilyIndex = queue_family_index.value();

VkCommandPool command_pool;
vkCreateCommandPool(device, &command_pool_create_info, nullptr, &command_pool);
  1. Create a command buffer.

VkCommandBufferAllocateInfo command_buffer_allocate_info{};
command_buffer_allocate_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
command_buffer_allocate_info.commandPool = command_pool;
command_buffer_allocate_info.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
command_buffer_allocate_info.commandBufferCount = 1;

VkCommandBuffer command_buffer;
vkAllocateCommandBuffers(device, &command_buffer_allocate_info, &command_buffer);

15. Write commands in the created command buffer.

VkCommandBufferBeginInfo command_buffer_begin_info{};
command_buffer_begin_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
// use this flag if commands areonly used once.
command_buffer_begin_info.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;

// use this flag if commands recorded at the first time are continously used
command_buffer_begin_info.flags = VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT;

// reset if using command buffer again.
vkResetCommandBuffer(command_buffer, 0);

vkBeginCommandBuffer(command_buffer, &command_buffer_begin_info);

// recoding commands
vkCmdBindPipeline(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
vkCmdBindDescriptorSets(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline_layout, /*firstSet=*/0, 1, &desc_set, 0, nullptr);
vkCmdPushConstants(command_buffer, pipeline_layout, VK_PIPELINE_BIND_POINT_COMPUTE, 0, sizeof(PushArgs), &push_args);
vkCmdDispatch(command_buffer, WORKGROUP_X, WORKGROUP_Y, WORKGROUP_Z);
// end of recording commands

vkEndCommandBuffer(command_buffer);

16. Submit the written commands to the queue.

VkFenceCreateInfo fence_create_info{};
fence_create_info.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;

VkFence fence;
vkCreateFence(device, &fence_create_info, nullptr, &fence);

VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.commandBufferCount = 1;
submit_info.pCommandBuffers = &command_buffer;

vkQueueSubmit(queue, 1, &submit_info, fence);
vkWaitForFences(device, 1, &fence, /*waitAll=*/VK_TRUE, /*timeout=*/0);

17. If necessary, map the output buffer and print the result data.

  • If needed, repeat steps 8, 14, 15, 16, and 17 in order, or skip steps 14 and 15 if reusing commands.

void* output_ptr;
vkMapMemory(device, output_buffer_memory, /*offset=*/0, VK_WHOLE_SIZE, 0, &input_ptr);

// do something with output_ptr

vkUnmapMemory(device, output_buffer_memory);
output_ptr = nullptr;

18. Release all allocated resources.

vkFreeCommandBuffers(device, command_pool, 1, &command_buffer);
vkDestroyFence(device, fence, nullptr);
vkDestroyCommandPool(device, command_pool, nullptr);
vkDestroyPipeline(device, pipeline, nullptr);
vkDestroyPipelineLayout(device, pipeline_layout, nullptr);
vkDestroyShaderModule(device, shader_module, nullptr);
vkFreeDescriptorSets(device, desc_pool, 1, &desc_set);
vkDestroyDescriptorPool(device, desc_pool, nullptr);
vkDestroyDescriptorSetLayout(device, desc_set_layout, nullptr);
vkFreeMemory(device, uniform_buffer_memory, nullptr);
vkFreeMemory(device, output_buffer_memory, nullptr);
vkFreeMemory(device, input_buffer_memory, nullptr);
vkDestroyBuffer(device, uniform_buffer, nullptr);
vkDestroyBuffer(device, output_buffer, nullptr);
vkDestroyBuffer(device, input_buffer, nullptr);
vkDestroyDevice(device, nullptr);

// destory debug callback as well if it has been generated.
vkDestroyDebugReportCallbackEXT(instance, debug_callback, nullptr);

vkDestroyInstance(instance, nullptr);

We have explained the concepts and usage of Vulkan. We hope this will be helpful for those looking to utilize GPU acceleration with Vulkan. Our goal is to complete the development of GPU support features so that you can perform GPU acceleration through Optimium and Nadya in an abstracted and convenient way, without having to do the tasks yourself. In the next post, we will cover OpenCL, so please look forward to it!

Optimium

Optimium

Solutions

Resources

ENERZAi

Copyright ⓒ ENERZAi Inc. All Rights Reserved

사업자등록번호: 246-86-01405

이메일: contact@enerzai.com

연락처: +82 (2) 883 1231

주소: 대한민국 서울특별시 강남구 테헤란로27길 27

Optimium

Optimium

Solutions

Resources

ENERZAi

Copyright ⓒ ENERZAi Inc. All Rights Reserved

사업자등록번호: 246-86-01405

이메일: contact@enerzai.com

연락처: +82 (2) 883 1231

주소: 대한민국 서울특별시 강남구 테헤란로27길 27

Optimium

Optimium

Solutions

Resources

ENERZAi

Copyright ⓒ ENERZAi Inc. All Rights Reserved

사업자등록번호: 246-86-01405

이메일: contact@enerzai.com

연락처: +82 (2) 883 1231

주소: 대한민국 서울특별시 강남구 테헤란로27길 27