A glance at Compute Shaders

Description

"In a previous contribution ([computing flock of boids using GPU](https:\/\/thegodotbarn.com\/contributions\/snippet\/206\/basic-flocking-gpu-version)), we've seen how we can use compute shader to calculate a **vast amount of data using parallelism and fast memory access**. It can go further and delegate an **entire pipeline of data computation to the GPU that will be drawn without even using CPU**, leaving computation time for other matters.\r\n\r\nThe following article will focus on how to create such a pipeline through the practical example of a flock of boids.\r\n\r\n> [!IMPORTANT]\r\n> What will **be discussed here** :\r\n> * Manage compute shader in Godot through GDScript\r\n> * Link shaders between them so all the computations can be done directly in GPU\r\n> \r\n> What will **not be discussed here** :\r\n> * GLSL grammar and how to code compile shader\r\n> * Specificities about the data structure, byte buffer alignment, etc.\r\n\r\n## The Flock\r\n\r\nLet say we want to animate a set of objects that interact between themselves. It requires to implement an algorithm that makes great use of distances and vectors computations. Using a scripting language like GDScript and its interpreter or even a language like C# and its virtual machine, the overhead tied to their nature will quickly limit us. We'll be able to compute the state of at most a couple of hundreds of objects, leaving us with nothing for the rest.\r\nBesides, scripting tools are not meant to make heavy calculus. Of course, we could use native languages and build libraries or modules to achieve better result.\r\n\r\nOr, we could directly use the GPU. Enters the **compute shaders**.\r\n\r\nA compute shader is a function that will be run on GPU, a specified amount of time (we talk about **invocations**). Each invocation can be identified. For example, we can use those invocation identification to determine for which object of the flock we want to compute the state. This permits to compute this data in **parallel** rather than sequentially, and thus, speeding up the whole process.\r\n\r\n> [!NOTE]\r\n> Invocations are distributed over the cores of the GPU. For the sake of simplicity, we won't talk here about workgroups and their layout. These are organized in three dimension, over the X, Y and Z axises, and can be subdivided into local workgroups. More complex parallel algorithms rely on this organization, which is not our case here. This will be the subject of a further article.\r\n> For now, we will just use basic invocation case, on only one axis (X) and just a little performance spice by grouping invocations into packs of 16 local workgroups.\r\n\r\nThe algorithm for simulation a flock of birds\/fishes consists in a big 'for loop' on each component of the flock to determine its interaction with the rest of the flock. So, to parallelize this big 'for loop', we will transform its content into a compute shader that will be invoked for each component of the flock.\r\n\r\n> [!TIP]\r\n> Details about the underlying flocking algorithm can be found on [Craig Reynolds' page](https:\/\/www.red3d.com\/cwr\/boids\/).\r\n\r\n{#the_compute_shader}\r\n## The Compute Shader\r\n\r\nIn Godot, any kind of shader is a **resource**. A compute shader can be loaded either by reading a file or a simple string stored in the script. In either case, it will need a `RenderingDevice` to be **compiled into SPIR-V** (Standard Portable Intermediate Representation) and then produce a shader that we could use.\r\n\r\nThe `RenderingDevice` can be either a local one (see it as dedicated to the execution of our shader) or the main one, used by the `RenderingServer`. The latter will be needed is we want our shader to share data with other shaders without using the CPU as a boilerplate to transfer data.\r\n\r\nLet's explore the two cases:\r\n\r\n```\r\n# Load from file\r\nfunc import_shader_from_file(path : String) -> RDShaderSPIRV:\r\n\tvar shader_file := load(path)\r\n\tassert(shader_file != null and shader_file is RDShaderFile)\r\n\tvar spirv := (shader_file as RDShaderFile).get_spirv()\r\n\tassert(spirv.compile_error_compute.is_empty())\r\n\treturn spirv\r\n\r\n# Load from string\r\nfunc import_shader_from_string(code : String, rd : RenderingDevice) -> RDShaderSPIRV:\r\n\tvar shader_source := RDShaderSource.new()\r\n\tshader_source.language = RenderingDevice.SHADER_LANGUAGE_GLSL\r\n\tshader_source.source_compute = shader_code\r\n\tvar spirv := rd.shader_compile_spirv_from_source(shader_source)\r\n\tassert(spirv.compile_error_compute.is_empty())\r\n\treturn spirv\r\n```\r\n\r\n> [!NOTE]\r\n> You might have noticed that the shader language could also be HLSL, which can be handy for our fellow friends coming from Unity :)\r\n\r\nOnce we got the SPIR-V resource, we just need to create (or register) the shader into the `RenderingDevice` :\r\n\r\n```\r\n# 'rd' is a variable containing a reference to the chosen instance of RenderingDevice :\r\nvar rd := RenderingServer.get_rendering_device() # fetching the main rendering device\r\n# var rd := RenderingServer.create_local_rendering_device() # creating a local rendering device, separated from the main rendering pipeline.\r\n\r\nvar spirv = import_shader_from_string(our_shader_code, rd)\r\n\r\nvar shader_rid : RID = rd.shader_create_from_spirv(spirv)\r\nassert(shader_rid.is_valid())\r\n```\r\n\r\nWe retrieve a **resource identifier that point to the linked shader**.\r\n\r\n## Feeding the Shader\r\n\r\nAs for any program, a **compute shader manipulates data**. Those data can be local to the shader (non permanent) or global and shared amongst the invocations. Those global data are called uniforms (or sometimes storage buffer). Such global data can be **sent, updated or read from the GPU**.\r\n\r\n> [!IMPORTANT]\r\n> Updating or retrieving data from GPU is a costly operation. It should be used carefully at strategic moment. For example, one shot at first, when setting initial data, or sparsely by updating few key information when needed. Retrieving a whole buffer (or texture) each frame could cause slowdowns and consume a vast amount of bandwidth between the CPU and the GPU.\r\n\r\nGlobal data have to be declared (and their space reserved in memory), then bound to one or more shader. And that's what allow us transferring data from one shader to another seamlessly (or even broadcast those data to a set of shaders).\r\n\r\nIn our example (flock of boids), we only use two types of data : **uniform buffer** and **storage buffer**. The difference between the two is tied to their **size, structure and access permissions** (qualifiers).\r\n* **Uniform buffers** can't be modified (read only). Those are constants that are passed to the shader as parameters and usually consists in a well defined set of data. The size of a uniform buffer is limited.\r\n* **Storage buffers** are large set of data. They can modified (if the right qualifier is set). They are usually employed as output data. Their size are not constrained (aside being as large as the graphic card VRAM).\r\n\r\nSuch data can be **declared independently from shaders**. They are just **chunk of VRAM**. In Godot, they can be declared using the following:\r\n\r\n{#register_buffer}\r\n```\r\nfunc register_buffer(rendering_device : RenderingDevice, type : RenderingDevice.UniformType, size : int, bytes : PackedByteArray = PackedByteArray()) -> RID:\r\n\tvar buffer_rid : RID\r\n\tif type == RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER:\r\n\t\tbuffer_rid = rendering_device.storage_buffer_create(size, bytes)\r\n\telse:\r\n\t\tbuffer_rid = rendering_device.uniform_buffer_create(size, bytes)\r\n\tassert(buffer_rid.is_valid())\r\n\treturn buffer_rid\r\n```\r\n\r\n> [!NOTE]\r\n> The size does matter. The GPU needs to know the amount of data to allocate. But the initial content can be omitted.\r\n\r\n> [!NOTE]\r\n> At this point, the data structure doesn't matter.\r\n\r\nAgain, we'll retrieve a RID. Those are identifiers used by Godot to retrieve the resource (and by extension, identifier used by the graphic card itself to retrieve any kind of data).\r\n{#getting_multimeshinstance_buffer}\r\nSometimes, such **storage already exists** and their RID can be retrieved so we can access this data. This is the case of the **transformation buffer of a `MultiMeshInstance3D`**. Its RID can be retrieved using the `RenderingServer` function `multimesh_get_buffer_rd_rid` as follow :\r\n\r\n```\r\n# output_mesh_instance is a reference to a MultiMeshInstance3D node.\r\nvar multimesh_rid := output_mesh_instance.multimesh.get_rid()\r\nvar multimesh_buffer_rid := RenderingServer.multimesh_get_buffer_rd_rid(multimesh_rid)\r\n```\r\n\r\n{#declaring_the_uniforms}\r\nIn our shader, we define **binding points**. They will shape the **structure of the data** we want to manipulate. If you pay attention to the beginning of our example compute shader, you'll see declarations such as :\r\n\r\n```glsl\r\nstruct Boid {\r\n\tvec3 position[2];\r\n\tvec3 speed;\r\n};\r\n\r\nlayout(set = 0, binding = 0, std430) restrict buffer Flock {\r\n\tBoid boids[];\r\n} flock;\r\n\r\nlayout(binding = 1, std140) uniform Behavior {\r\n\tfloat attraction_distance;\r\n\tfloat repulsion_distance;\r\n\tfloat separation;\r\n\tfloat alignment;\r\n\tfloat cohesion;\r\n\tfloat purpose;\r\n\tfloat acceleration;\r\n\tfloat max_velocity;\r\n};\r\n```\r\n\r\nThe first one is a storage buffer. It stores the state of each boids in the flock according to the declared structure in an unbounded array.\r\nThe second one is an uniform buffer. It stores parameters about the flock behavior. They are constants (but can be updated by a CPU call).\r\nEach one is associated to a `binding` number. The number is used to link the buffers we previously reserved to those structures.\r\n\r\n```\r\nfunc create_uniform(binding : int, type : RenderingDevice.UniformType, buffer_rid : RID) -> RDUniform:\r\n\tvar uniform := RDUniform.new()\r\n\tuniform.binfing = binding\r\n\tuniform.uniform_type = type\r\n\tuniform.add_id(buffer_rid)\r\n\treturn uniform\r\n```\r\n\r\nOnce we have declared all the buffers and obtained a **list of uniforms**, we can **link them to the shader** as follow:\r\n\r\n```\r\nvar my_buffer_id := register_buffer(rd, RenderingDevice.UNIFORM_TYPE_UNIFORM_BUFFER, my_buffer_size)\r\nvar my_uniform := create_uniform(binding_number_in_the_shader, RenderingDevice.UNIFORM_TYPE_UNIFORM_BUFFER, my_buffer_id)\r\n\r\nvar uniforms := [ my_uniform, another_uniform ] # make an array with all the needed uniforms\r\n\r\n# Use the 'shader_rid' we previously obtained from the shader compilation.\r\nvar uniform_set := rd.uniform_set_create(uniforms, shader_id, 0)\r\n```\r\n\r\n`shader_id` refers to the [shader identifier we obtained previously](#the_compute_shader).\r\n\r\n> [!NOTE]\r\n> Once again, the uniform set is referenced by an RID. This information should be conserved for further usage, such as resource cleaning when the program ends or the compute shader in no longer in use.\r\n\r\n## The Pipeline\r\n\r\nWe are almost done. The last thing we need before **invoking the compute shader** is to create its **computation pipeline**. It will wrap up the shader and the [previously declared uniforms together](#declaring_the_uniforms).\r\n\r\n```\r\nvar pipeline := rd.compute_pipeline_create(shader_rid)\r\n```\r\n\r\nOnce again, we use the `shader_rid` we've obtained [previously](#the_compute_shader).\r\n\r\nWe are now ready to **invoke our compute shader**. To do so, we simply have to do the following:\r\n```\r\nfunc run_shader(rd : RenderingDevice, pipeline : RID, uniform_set : RID) -> void:\r\n\tvar list := rd.compile_list_begin()\r\n\trd.compute_list_bind_compute_pipeline(list, pipeline)\r\n\trd.compute_list_bind_uniform_set(list, uniform_set, 0)\r\n\trd.compute_list_dispatch(list, boid_count, 1, 1)\r\n\trd.compute_list_end()\r\n```\r\n\r\nIn this particular example, we will dispatch `boid_count` workgroups along the X axis (one for each object in our flock). _Remember; the workgroups layout will be the subject of a further article._\r\n\r\n> [!TIP]\r\n> You can see that we need to keep the `pipeline` and `uniform_set` RID at reach in order to perform the invocations. This is the case of almost all the RID we have collected through this adventure.\r\n\r\nHere, we only have invoked one compile shader, but you can easily imagine a **_compile list_ that binds and dispatches a list of shaders** (hence the name _compile list_). In more complex use cases, this is required as a single compute shader might not suffice (computation using several passes).\r\n\r\nDepending on the nature of the `RenderingDevice`, the invocation process might differ.\r\nUsing the main `RenderingDevice`, we will process as follow:\r\n```\r\nRenderingServer.call_on_render_thread(run_shader.bind([RenderingServer.get_rendering_device(), pipeline, unifom_set]))\r\n```\r\nBut with a local `RenderingDevice`, we will have to manually submit the compile list.\r\n```\r\nrun_shader(rd, pipeline, uniform_set)\r\nrd.submit()\r\nrd.sync()\r\n```\r\n\r\n## Updating the Buffers\r\n\r\nIt is often needed to **update some informations** to drive the computation results. To do so, one just have to call the `buffer_update` function of the hosting `RenderingDevice`. A good example is the `_update_behavior_values` of [the flock of boids script](https:\/\/thegodotbarn.com\/contributions\/snippet\/206\/basic-flocking-gpu-version).\r\n\r\n```\r\nfunc _update_behavior_values() -> void:\r\n\tif need_behavior_update:\r\n\t\tvar flock_behavior := PackedFloat32Array()\r\n\t\tflock_behavior.resize(8)\r\n\t\tflock_behavior[0] = behavior_attraction_distance\r\n\t\tflock_behavior[1] = behavior_repulsion_distance\r\n\t\tflock_behavior[2] = behavior_separation\r\n\t\tflock_behavior[3] = behavior_alignment\r\n\t\tflock_behavior[4] = behavior_cohesion\r\n\t\tflock_behavior[5] = behavior_purpose\r\n\t\tflock_behavior[6] = capacity_acceleration\r\n\t\tflock_behavior[7] = capacity_max_velocity\r\n\t\tvar flock_behavior_bytes := flock_behavior.to_byte_array()\r\n\t\trd.buffer_update(flock_behavior_buffer, 0, flock_behavior_bytes.size(), flock_behavior_bytes)\r\n\t\tneed_behavior_update = false\r\n```\r\n\r\nHere, `flock_behavior_buffer` refers to a buffer we have [previously registered](#register_buffer).\r\n\r\n## Linking The Shaders\r\n\r\nIn the case of our example, we want the **compute shader** to process data in order to **set the transformations of a `MultiMeshInstance3D`**. As [mentioned before](#getting_multimeshinstance_buffer), the buffer where the `MultiMeshInstance3D` stores all its transformations is accessible. We can retrieve the infamous RID and just as we've done before, we can bind it to our compute shader, as follow:\r\n```\r\n\tvar multimesh_rid := output_mesh_instance.multimesh.get_rid()\r\n\tvar multimesh_buffer_rid := RenderingServer.multimesh_get_buffer_rd_rid(multimesh_rid)\r\n\tassert(multimesh_buffer_rid.is_valid())\r\n\tvar multimesh_input_uniform := create_uniform(the_binding_number, RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER, multimesh_buffer_rid)\r\n```\r\n\r\nThe obtained uniform shall then be added to the array of uniforms we use to declare the uniform set.\r\n\r\n> [!WARNING]\r\n> The `MultiMeshInstance3D` is managed by the main `RenderingDevice`. This means that the compile shader must be run on the same `RenderingDevice` and not in a local one.\r\n\r\nAnd that's it ! \r\n\r\n## Conclusion\r\n\r\nIn this article, we've took a glimpse at how to use compute shaders within Godot. In the light of the information displayed here, I encourage you to assimilate\/dissect what has been done in [the flock of boids](https:\/\/thegodotbarn.com\/contributions\/snippet\/206\/basic-flocking-gpu-version) contribution.\r\n\r\nFor any question don't hesitate to reach me on the comment section or on our [Discord server](https:\/\/discord.gg\/KqHTFEJXw9)."

Description

Comments

Table of contents

Compatibility

Tags