Hello, this is Jin-Hwan Shin, and I am developing runtime at ENERZAi. As previously introduced in other posts, our proprietary programming language, Nadya, currently supports only CPU. However, we are conducting research and development to expand its coverage to GPU. Unlike the CPU, where code can be compiled and executed immediately, the GPU requires a separate library to support code execution. Well-known libraries including DirectX, Metal, and Vulkan fall into this category.
In addition to these Graphics Libraries such as DirectX and Vulkan, there are also GPGPU Libraries for computing, including CUDA and OpenCL. As GPUs are increasingly utilized for general computations, Graphics Libraries have also started supporting GPGPU by adding features such as DirectX Compute Shader (formerly DirectCompute), Metal Performance Shader, and Vulkan Compute Shader.
In this post, we will explore the Vulkan Compute Shader, which performs computations using Vulkan, the open standard for Graphics. First, we will explain the concepts necessary for using Vulkan and then will demonstrate a brief usage example with code. We hope this will be helpful to many of you.
Vulkan Compute Shader
Vulkan runtime supports the execution of compiled code on the GPU through the flow described above. Detailed features are organized below by important keywords for your reference
Instance
Now, unlike OpenGL which uses global state, Vulkan supports the storage of application-specific states. Accordingly, the object that stores the state for each application is the vkInstance object.
Physical Device & Logical Device
In Vulkan, devices are divided into physical devices (vkPhysicalDevice
) and logical devices (vkDevice
).
A physical device represents one Vulkan implementation (it can be considered as one GPU, similar to the concept of a Platform in OpenCL), and a logical device is an object instantiated from a physical device, each having unique resources and states. You can enumerate physical devices using the vkEnumeratePhysicalDevice
function and retrieve information about a physical device using the vkGetPhysicalDeviceProperties
function.
A queue is a conduit for processing commands. Multiple queues can exist, and queues that perform similar roles are grouped into queue families. One queue family can handle multiple types of commands, and multiple queue families can handle one type of command. The types of target commands are as follows:
Video Decode
Video Encode
Graphics
Compute
Transfer
Memory management
Information about queue families of a physical device can be obtained using the vkGetPhysicalDeviceQueueFamilyProperties
function.
Once you find an appropriate physical device using the above functions, you can create a logical device from that physical device, and this task is performed using the vkCreateDevice
function.
Buffer & Memory
Buffer
In Vulkan, buffer and memory are managed separately. A buffer is just a view of memory, and the actual memory is managed separately. Therefore, in Vulkan, you need to bind memory to buffers.
vkBuffer
objects can be created using the vkCreateBuffer
function.
Memory can be bound to buffer objects using the vkBindBufferMemory
function.
There are various types of buffers that can be used in shaders, but the following three types are most commonly used in compute shaders:
Storage Buffer
A buffer capable of reading and writing large amounts of data. Suitable for storing information such as tensors, weights, and biases.
2. Uniform Buffer
A buffer suitable for reading small amounts of data. Often used to pass parameters to the kernel.
3. Push Constant Buffer
Similar to the Uniform Buffer but used slightly differently.
Memory
Unlike OpenCL or OpenGL, in Vulkan, device memory allocation must be handled by the application. In OpenCL and OpenGL, you only need to specify the size and access method of the desired memory, and it is automatically allocated. However, in Vulkan, you must enumerate the memory heap of the device, find the available memory area, and perform the complex task of allocating to that area.
Device memory is classified as follows based on whether it can be accessed by the device and host:
Device-local
Memory accessible only by the device.
2. Device-local, Host-visible
Device memory that is also accessible by the host.
3. Host-local, Host-visible
Host memory that is also accessible by the device.
Depending on the device, these memories may be separated, or one memory can deal with all the tasks. You can achieve higher performance if memory is allocated to the appropriate area that corresponds to the purpose of the buffer.
Information about the memory of a physical device can be obtained using the vkGetPhysicalDeviceMemoryProperties
function, and device memory can be allocated using the vkAllocateMemory
function with the memory type index obtained from this function. Additionally, in Vulkan, there is a limit on the maximum number of memory allocations, so memory can only be allocated up to VkPhysicalDeviceLimits::maxMemoryAllocationCount
.
If necessary, you may need to allocate one bulk memory and then bind a portion of the memory to buffers when creating buffers.
Pipeline
A pipeline is a sequence of processes performed by a device. It serves as a document that describes various information, such as which resources will be used, what stages exist, and which shaders are used in those stages.
The compute pipeline omits several processes compared to the graphics pipeline. While the graphics pipeline involves many stages, such as vertex shader, geometry shader, pixel shader, and ray tracing, the compute pipeline just performs computations on input buffer values and writes the result to the output buffer. Hence, it has only one stage, the compute shader.
Therefore, once the compute shader is determined, a compute pipeline can be created using the vkCreateComputePipelines
function.
Descriptor
A descriptor is an object that represents resources used by shaders. Descriptors allow shaders to bind the resources they use.
Descriptors are classified into Descriptor Pool, Descriptor Set Layout, and Descriptor Set as follows:
Descriptor Pool
A pool for allocating Descriptor Sets. It acts as a memory allocator.
2. Descriptor Set Layout
Describes the structure of the Descriptor Set. It can be shared among multiple shaders using the same layout.
3. Descriptor Set
An object that contains actual information. It contains information such as which actual buffer each buffer in the set is mapped to and the offset.
Additionally, there are limitations on the resources that can be used when creating Descriptor Sets. Related information can be obtained through the VkPhysicalDeviceLimits
structure.
VkPhysicalDeviceLimits::maxPushConstantSize
The maximum size of the Push Constant Buffer.
2. VkPhysicalDeviceLimits::maxPerStageDescriptorUniformBuffers
The maximum number of uniform buffers that can be used in a single stage.
3. VkPhysicalDeviceLimits::maxPerStageDescriptorStorageBuffers
The maximum number of storage buffers that can be used in a single stage.
4. VkPhysicalDeviceLimits::maxPerStageResources
The maximum number of all resources that can be used in a single stage.
5. VkPhysicalDeviceLimits::maxDescriptorSetUniformBuffers
The maximum number of uniform buffers that can be used in a single Descriptor Set.
VkPhysicalDeviceLimits::maxDescriptorSetStorageBuffers
The maximum number of storage buffers that can be used in a single Descriptor Set.
Command Pool & Command Buffer
A command buffer is a buffer that records commands to be executed by the GPU. Commands are written to this buffer and then submitted to the queue all at once. And command pool is a memory allocator that allocates command buffers to memory. After writing commands to a command buffer only once (allocated to memory through the command pool), the recorded commands can be repeatedly submitted, efficiently executing commands.
In a compute shader, command buffers record commands using the vkCmd~
series of functions between vkBeginCommandBuffer
and vkEndCommandBuffer
.
vkCmdBindDescriptorSets
: Binds the descriptor set to be used. The bound buffer will be used from this point onwards.
vkCmdDispatch
: Executes the compute shader.
vkCmdCopyBuffer
: Copies buffers.
vkCmdPipelineBarrier
: Inserts a barrier between pipelines. Memory barriers can also be inserted using this function.
vkCmdPushConstants
: Writes values to the Push Constant Buffer.
vkCmdSetEvent
: Raises an event.
vkCmdResetEvent
: Resets an event.
vkCmdWaitEvents
: Waits for an event to occur.
If you want to clear and reuse the commands in a command buffer, you can initialize using the vkResetCommandBuffer
function before vkBeginCommandBuffer
to reuse them.
The constraints when executing a compute shader are as follows:
VkPhysicalDeviceLimits::maxComputeSharedMemorySize
The maximum size of the shared memory (local memory in OpenCL).
2. VkPhysicalDeviceLimits::maxComputeWorkGroupCount
The maximum global workgroup size.
3. VkPhysicalDeviceLimits::maxComputeWorkGroupSize
The maximum local workgroup size.
4. VkPhysicalDeviceLimits::maxComputeWorkGroupInvocations
The maximum size that can be invoked in a local workgroup. The product of all dimension values of the local workgroup size must not exceed this value.
Fences, Semaphores, Events, and Barriers
Fence
A fence is an object used for synchronization between the host and the command queue, supporting tasks such as waiting for the completion of enqueued commands. If a fence is submitted along with the command when submitting the recorded commands in the command buffer to the queue with the vkQueueSubmit
function, you can wait until the command is finished.
It is created using the vkCreateFence
function, and you can instruct to wait until one or more fences are signaled using the vkWaitForFences
function.
Semaphore
A semaphore is a synchronization object used to insert dependencies between command queues. It supports tasks such as waiting until a specific queue is finished or executing commands after a specific queue is finished. A semaphore is created using the vkCreateSemaphore
function, and dependencies are set when submitting commands using the vkQueueSubmit
function.
Event
An event is a synchronization object used to insert dependencies between commands. It can only be used between commands in the same queue. An event supports tasks such as waiting until a specific command is finished or executing commands after a specific command is executed. It is created using the vkCreateEvent
function, and unlike other synchronization objects, it supports synchronization not only within the device but also between device ↔ host. The host can signal using vkSetEvent
and vkResetEvent
, and the device can signal using vkCmdSetEvent
and vkCmdResetEvent
when recording commands in the command buffer.
Barrier
A barrier is a synchronization object used to insert dependencies between commands. It performs a similar role to an event but can only be used within the device and works between queues. Additionally, it is used for memory synchronization and records commands in the command buffer using the vkCmdPipelineBarrier
function.
Execute Kernel with Vulkan
Template compute shader code
layout(set=0, binding=0) readonly buffer Input {
float input_tensor[];
} input;
layout(set=0, binding=1) writeonly buffer Output {
float output_tensor[];
} output;
layout(set=0, binding=2) uniform UniformArgs {
} uniform_args;
layout(push_constant) uniform PushArgs {
} push_args;
void main() {
Create an instance.
VkApplicationInfo appInfo{};
appInfo.sType = VK_STRUCTURE_TYPE_APPLIATION_INFO;
appInfo.pApplicationName = "Hello World App";
appInfo.applicationVersion = VK_MAKE_VERSION(0, 0, 1);
appInfo.pEngineName = "No Engine";
appInfo.engineVersion = VK_MAKE_VERSION(0, 0, 1);
appInfo.apiVersion = VK_API_VERSION_1_2;
VkInstanceCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;
createInfo.pApplicationInfo = &appInfo;
const char* layers[1] = {
"VK_LAYERS_KHRONOS_validation"
};
VkValidationFeatureEnableEXT enable_features[] = {
VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT,
VK_VALIDATION_FEATURE_ENABLE_BEST_PRACTICE_EXT
};
VkValidationFeaturesEXT validation_features{};
validation_features.sType = VK_STRUCTURE_TYPE_VALIDATION_FEATURES_EXT;
validation_features.enabledValidationFeaturesCount = 2;
validation_features.pEnabledValidationFeatures = enable_features;
createInfo.enabledLayerCount = 1;
createInfo.ppEnabledLayerNames = layers;
createInfo.pNext = &validation_features;
VkDebugReportCallbackCreateInfo debug_callback_create_info{};
debug_callback_create_info.sType = VK_STRUCTURE_TYPE_DEBUG_REPORT_CALLBACK_CREATE_INFO_EXT;
debug_callback_create_info.flags = VK_DEBUG_REPORT_INFORMATION_BIT_EXT|VK_DEBUG_REPORT_WARNING_BIT_EXT|VK_DEBUG_REPORT_PERFORMANCE_WARNING_BIT_EXT|VK_DEBUG_REPORT_ERROR_BIT_EXT|VK_DEBUG_REPORT_DEBUG_BIT_EXT;
debug_callback_create_info.pfnCallback = DebugCallback;
VkDebugReportCallbackEXT debug_callback;
vkCreateDebugReportCallbackEXT(instance, &debug_callback_create_info, nullptr, &debug_callback);
VkInstance instance;
vkCreateInstance(&createInfo, nulptr, &instance);
2. Search for an appropriate physical device.
std::vector<VkPhysicalDevice> physical_devices;
uint32_t device_count = 0;
vkEnumeratePhysicalDevices(instance, &device_count, nullptr);
physical_devices.resize(device_count);
vkEnumeratePhysicalDevices(instance, &device_count, physical_devices.data());
VkPhysicalDevice selected_physical_device = VK_NULL_HANDLE;
for (auto physical_device : physical_devices) {
VkPhysicalDeviceProperties device_properties;
VkPhysicalDeviceFeatures device_features;
VkPhysicalDeviceMemoryProperties device_memory_properties;
std::vector<VkQueueFamilyProperties> queue_family_properties;
uint32_t queue_family_property_count;
vkGetPhysicalDeviceProperties(physical_device, &device_properties);
vkGetPhysicalDeviceFeatures(physical_device, &device_features);
vkGetPhysicalDeviceMemoryProperties(physical_device, &device_memory_properties);
vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, nullptr);
queue_family_properties.resize(queue_family_property_count);
vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, queue_family_properties.data());
3. Once an appropriate physical device is found, create a logical device.
std::optional<uint32_t> queue_family_index;
std::vector<VkQueueFamilyProperties> queue_family_properties;
uint32_t queue_family_property_count;
vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, nullptr);
queue_family_properties.resize(queue_family_property_count);
vkGetPhysicalDeviceQueueFamilyProperties(physical_device, &queue_family_property_count, queue_family_properties.data());
for (auto i = 0; i < queue_family_property_count; ++i) {
auto& properties = queue_family_properties[i];
if (properties.queueFlags & VK_QUEUE_COMPUTE_BIT) {
queue_family_index = i;
break;
}
}
assert(queue_family_index.has_value());
float queuePriority = 1.0f;
VkDeviceQueueCreateInfo queueCreateInfo{};
queueCreateInfo.sType = VK_STRUCTURE_DEVICE_QUEUE_CREATE_INFO;
queueCreateInfo.queueFamilyIndex = queue_family_index.value();
queueCreateInfo.queueCount = 1;
queueCreateInfo.pQueuePriorities = &queuePriority;
VkDeviceCreateInfo deviceCreateInfo{};
deviceCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;
deviceCreateInfo.pQueueCreateInfos = &queueCreateInfo;
deviceCreateInfo.queueCreateInfoCount = 1;
deviceCreateInfo.enabledExtensionCount = 0;
deviceCreateInfo.enabledLayerCount = 0;
VkDevice device;
vkCreateDevice(physical_device, &deviceCreateInfo, nullptr, &device);
4. Create a queue.
VkQueue queue;
vkGetDeviceQueue(device, queue_family_index.value(), 0, &queue);
5. Create input and output buffers.
VkBufferCreateInfo input_buffer_create_info{};
VkBufferCreateInfo output_buffer_create_info{};
VkBufferCreateInfo uniform_buffer_create_info{};
input_buffer_create_info.sType = VK_SCRUCTURE_TYPE_BUFFER_CREATE_INFO;
input_buffer_create_info.size = INPUT_SIZE;
input_buffer_create_info.usage = VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
input_buffer_create_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
output_buffer_create_info.sType = VK_SCRUCTURE_TYPE_BUFFER_CREATE_INFO;
output_buffer_create_info.size = OUTPUT_SIZE;
output_buffer_create_info.usage = VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
output_buffer_create_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
uniform_buffer_create_info.sType = VK_SCRUCTURE_TYPE_BUFFER_CREATE_INFO;
uniform_buffer_create_info.size = sizeof(UniformArgs);
uniform_buffer_create_info.usage = VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT;
uniform_buffer_create_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
VkBuffer input_buffer;
VkBuffer output_buffer;
VkBuffer uniform_buffer;
vkCreateBuffer(device, &input_buffer_create_info, nullptr, &input_buffer);
vkCreateBuffer(device, &output_buffer_create_info, nullptr, &output_buffer);
vkCreateBuffer(device, &uniform_buffer_create_info, nullptr, &uniform_buffer);
6. Allocate device memory for the buffers.
VkPhysicalDeviceMemoryProperties device_memory_properties;
vkGetPhysicalDeviceMemoryProperties(physical_device, &device_memory_properties);
constexpr VkMemoryPropertyFlagBits required_properties =
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
VkMemoryRequirements input_buffer_requirements;
VkMemoryRequirements output_buffer_requirements;
VkMemoryRequirements uniform_buffer_requirements;
vkGetBufferMemoryRequirements(device, input_buffer, &input_buffer_requirements);
vkGetBufferMemoryRequirements(device, output_buffer, &output_buffer_requirements);
vkGetBufferMemoryRequirements(device, uniform_buffer, &uniform_buffer_requirements);
auto input_memory_type_index = FindOptimalHeap(input_buffer_requirements.memoryTypeBits, required_properties);
auto output_memory_type_index = FindOptimalHeap(output_buffer_requirements.memoryTypeBits, required_properties);
auto uniform_memory_type_index = FindOptimalHeap(uniform_buffer_requirements.memoryTypeBits, required_properties);
assert(input_memory_type_index.has_value());
assert(output_memory_type_index.has_value());
assert(uniform_memory_type_index.has_value());
VkDeviceMemory input_buffer_memory;
VkDeviceMemory output_buffer_memory;
VkDeviceMemory uniform_buffer_memory;
VkMemoryAllocateInfo input_memory_allocate_info{};
VkMemoryAllocateInfo output_memory_allocate_info{};
VkMemoryAllocateInfo uniform_memory_allocate_info{};
input_memory_allocate_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
input_memory_allocate_info.allocationSize = input_buffer_requirements.size;
input_memory_allocate_info.memoryTypeIndex = input_memory_type_index.value();
output_memory_allocate_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
output_memory_allocate_info.allocationSize = output_buffer_requirements.size;
output_memory_allocate_info.memoryTypeIndex = output_memory_type_index.value();
uniform_memory_allocate_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
uniform_memory_allocate_info.allocationSize = uniform_buffer_requirements.size;
uniform_memory_allocate_info.memoryTypeIndex = uniform_memory_type_index.value();
vkAllocateMemory(device, &input_memory_allocate_info, &input_buffer_memory);
vkAllocateMemory(device, &output_memory_allocate_info, &output_buffer_memory);
vkAllocateMemory(device, &uniform_memory_allocate_info, &uniform_buffer_memory);
7. Bind the device memory to the buffers.
vkBindBufferMemory(device, input_buffer, input_buffer_memory, 0);
vkBindBufferMemory(device, output_buffer, output_buffer_memory, 0);
vkBindBufferMemory(device, uniform_buffer, uniform_buffer_memory, 0);
8. If necessary, map the input buffer and input the required data.
void* input_ptr;
vkMapMemory(device, input_buffer_memory, 0, VK_WHOLE_SIZE, 0, &input_ptr);
vkUnmapMemory(device, input_buffer_memory);
input_ptr = nullptr;
void* uniform_ptr;
vkMapMemory(device, uniform_buffer_memory, 0, VK_WHOLE_SIZE, 0, &uniform_ptr);
vkUnmapMemory(device, uniform_buffer_memory);
uniform_ptr = nullptr;
9. Create descriptor sets.
VkDescriptorSetLayoutBinding bindings[] = {
{ 0, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, nullptr },
{ 1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, nullptr },
{ 2, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, 1, VK_SHADER_STAGE_COMPUTE_BIT, nullptr }
};
VkDescriptorSetLayoutCreateInfo desc_set_layout_create_info{};
desc_set_layout_create_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
desc_set_layout_create_info.bindingCount = 3;
desc_set_layout_create_info.pBindings = bindings;
VkDescriptorSetLayout desc_set_layout;
vkCreateDescriptorSetLayout(device, &desc_set_layout_create_info, nullptr, &desc_set_layout);
VkDescriptorPoolSize desc_pool_size[] = {
{ VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, 2 },
{ VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, 1 }
};
VkDescriptorPoolCreateInfo desc_pool_create_info{};
desc_pool_create_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
desc_pool_create_info.maxSets = 1;
desc_pool_create_info.poolSizeCount = 2;
desc_pool_create_info.pPoolSizes = &desc_pool_size;
VkDescriptorPool desc_pool;
vkCreateDescriptorPool(device, &desc_pool_create_info, nullptr, &desc_pool);
VkDescriptorSetAllocateInfo desc_set_allocate_info{};
desc_set_allocate_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
desc_set_allocate_info.descriptorPool = desc_pool;
desc_set_allocate_info.descriptorSetCount = 1;
desc_set_allocate_info.pSetLayouts = &desc_set_layout;
VkDescriptorSet desc_set;
vkAllocateDescriptorSets(device, &desc_set_allocate_info, &desc_set);
10. Connect the descriptor sets to the buffers.
VkDescriptorBufferInfo input_desc_buffer_info{};
VkDescriptorBufferInfo output_desc_buffer_info{};
VkDescriptorBufferInfo uniform_desc_buffer_info{};
input_desc_buffer_info.buffer = input_buffer;
input_desc_buffer_info.offset = 0;
input_desc_buffer_info.range = WK_WHOLE_SIZE;
output_desc_buffer_info.buffer = output_buffer;
output_desc_buffer_info.offset = 0;
output_desc_buffer_info.range = WK_WHOLE_SIZE;
uniform_desc_buffer_info.buffer = uniform_buffer;
uniform_desc_buffer_info.offset = 0;
uniform_desc_buffer_info.range = WK_WHOLE_SIZE;
VkWriteDescriptorSet write_desc_set[] = {
{ VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, nullptr, desc_set, 0, 0, 1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, nullptr, &input_desc_buffer_info, nullptr },
{ VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, nullptr, desc_set, 1, 0, 1, VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, nullptr, &output_desc_buffer_info, nullptr },
{ VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET, nullptr, desc_set, 2, 0, 1, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER, nullptr, &uniform_desc_buffer_info, nullptr },
};
vkUpdateDescriptorSets(device, 3, write_desc_set, 0, nullptr);
11. Load the shader.
std::vector<char> shader_code;
VkShaderModuleCreateInfo shader_module_create_info{};
shader_module_create_info.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
shader_module_create_info.pCode = reinterpret_cast<uint32_t*>(shader_code.data());
shader_module_create_info.codeSize = shader_code.size();
VkShaderModule shader_module;
vkCreateShaderModule(device, &shader_module_create_info, nullptr, &shader_module);
12. Create the pipeline.
VkPushConstantRange push_constant_range{};
push_constant_range.offset = 0;
push_constant_range.size = sizeof(PushArgs);
push_constant_range.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
VkPipelineLayoutCreateInfo pipeline_layout_create_info{};
pipeline_layout_create_info.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
pipeline_layout_create_info.setLayoutCount = 1;
pipeline_layout_create_info.pSetLayouts = &desc_set_layout;
pipeline_layout_create_info.pushConstantRangeCount = 1;
pipeline_layout_create_info.pPushConstantRanges = &push_constant_range;
VkPipelineLayout pipeline_layout;
vkCreatePipelineLayout(device, &pipeline_layout_create_info, nullptr, &pipeline_layout);
VkComputePipelineCreateInfo compute_pipeline_create_info{};
compute_pipeline_create_info.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;
compute_pipeline_create_info.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
compute_pipeline_create_info.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT;
compute_pipeline_create_info.stage.module = shader_module;
compute_pipeline_create_info.stage.pName = "main";
compute_pipeline_create_info.layout = pipeline_layout;
VkPipeline pipeline;
vkCreateComputePipelines(device, VK_NULL_HANDLE, 1, &compute_pipeline_create_info, nullptr, &pipeline);
13. Create a command pool.
VkCommandPoolCreateInfo command_pool_create_info{};
command_pool_create_info.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
command_pool_create_info.flags |= VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT;
command_pool_create_info.queueFamilyIndex = queue_family_index.value();
VkCommandPool command_pool;
vkCreateCommandPool(device, &command_pool_create_info, nullptr, &command_pool);
Create a command buffer.
VkCommandBufferAllocateInfo command_buffer_allocate_info{};
command_buffer_allocate_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
command_buffer_allocate_info.commandPool = command_pool;
command_buffer_allocate_info.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
command_buffer_allocate_info.commandBufferCount = 1;
VkCommandBuffer command_buffer;
vkAllocateCommandBuffers(device, &command_buffer_allocate_info, &command_buffer);
15. Write commands in the created command buffer.
VkCommandBufferBeginInfo command_buffer_begin_info{};
command_buffer_begin_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
command_buffer_begin_info.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
command_buffer_begin_info.flags = VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT;
vkResetCommandBuffer(command_buffer, 0);
vkBeginCommandBuffer(command_buffer, &command_buffer_begin_info);
vkCmdBindPipeline(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
vkCmdBindDescriptorSets(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline_layout, 0, 1, &desc_set, 0, nullptr);
vkCmdPushConstants(command_buffer, pipeline_layout, VK_PIPELINE_BIND_POINT_COMPUTE, 0, sizeof(PushArgs), &push_args);
vkCmdDispatch(command_buffer, WORKGROUP_X, WORKGROUP_Y, WORKGROUP_Z);
vkEndCommandBuffer(command_buffer);
16. Submit the written commands to the queue.
VkFenceCreateInfo fence_create_info{};
fence_create_info.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
VkFence fence;
vkCreateFence(device, &fence_create_info, nullptr, &fence);
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.commandBufferCount = 1;
submit_info.pCommandBuffers = &command_buffer;
vkQueueSubmit(queue, 1, &submit_info, fence);
vkWaitForFences(device, 1, &fence, VK_TRUE, 0);
17. If necessary, map the output buffer and print the result data.
If needed, repeat steps 8, 14, 15, 16, and 17 in order, or skip steps 14 and 15 if reusing commands.
void* output_ptr;
vkMapMemory(device, output_buffer_memory, 0, VK_WHOLE_SIZE, 0, &input_ptr);
vkUnmapMemory(device, output_buffer_memory);
output_ptr = nullptr;
18. Release all allocated resources.
vkFreeCommandBuffers(device, command_pool, 1, &command_buffer);
vkDestroyFence(device, fence, nullptr);
vkDestroyCommandPool(device, command_pool, nullptr);
vkDestroyPipeline(device, pipeline, nullptr);
vkDestroyPipelineLayout(device, pipeline_layout, nullptr);
vkDestroyShaderModule(device, shader_module, nullptr);
vkFreeDescriptorSets(device, desc_pool, 1, &desc_set);
vkDestroyDescriptorPool(device, desc_pool, nullptr);
vkDestroyDescriptorSetLayout(device, desc_set_layout, nullptr);
vkFreeMemory(device, uniform_buffer_memory, nullptr);
vkFreeMemory(device, output_buffer_memory, nullptr);
vkFreeMemory(device, input_buffer_memory, nullptr);
vkDestroyBuffer(device, uniform_buffer, nullptr);
vkDestroyBuffer(device, output_buffer, nullptr);
vkDestroyBuffer(device, input_buffer, nullptr);
vkDestroyDevice(device, nullptr);
vkDestroyDebugReportCallbackEXT(instance, debug_callback, nullptr);
vkDestroyInstance(instance, nullptr);
We have explained the concepts and usage of Vulkan. We hope this will be helpful for those looking to utilize GPU acceleration with Vulkan. Our goal is to complete the development of GPU support features so that you can perform GPU acceleration through Optimium and Nadya in an abstracted and convenient way, without having to do the tasks yourself. In the next post, we will cover OpenCL, so please look forward to it!