Vulkan Foliage rendering utilizing GPU Instancing
I used to be watching acerola’s video on foliage rendering and I appreciated the thought of rendering thousands and thousands of grass blades, it was a superb alternative to mess around with GPU instancing and oblique draw.
What I am going to Be Utilizing
The options and extensions set used on this abroach are:
-
VK_EXT_buffer_device_address
This extension permits for utilizing pointers in GLSL and passing them in push constants or buffers, this extension together withVK_EXT_descriptor_indexing
makes coping with buffers and textures soo a lot simpler and nicer IMHO. -
multiDrawIndirect
This characteristic permits for a number of draw calls in oblique buffer make use of them to attract a number of LODs of the grass.
How It Works
Mainly, there is a compute cross that generates information concerning the grass blades shops them in a buffer, and does frustum culling and LOD choice and fill oblique instructions, then a graphics cross to attract the blades.
Compute Cross
our grass blade is outlined by the next
struct GrassBlade {
vec4 pos_bend;
vec4 size_anim_pitch;
};
we wish to generate all of this knowledge we begin by
Place(displacement from the middle)
It begins by producing random positions inside a rectangle space outlined by heart, width and top utilizing the next method
vec3 rand(uvec3 v) {
v = v * 1664525u + 1013904223u;
v.x += v.y * v.z;
v.y += v.z * v.x;
v.z += v.x * v.y;
v ^= v >> 16u;
v.x += v.y * v.z;
v.y += v.z * v.x;
v.z += v.x * v.y;
return vec3(v) * (1.0 / float(0xffffffffu));
}
uvec2 i = gl_GlobalInvocationID.xy;
vec3 pos = laptop.aria_center.xyz;
vec3 rand_val = rand(uvec3(i, 378294));
pos.x += (laptop.aria.x / 2) - rand_val.x * laptop.aria.x;
pos.z += (laptop.aria.y / 2) - rand_val.y * laptop.aria.y;
Peak & Width
then it generates a uv coords for every grass blade to pattern a texture by doing
vec2 bottom_left_corner = laptop.aria_center.xz - (laptop.aria.xy / 2.0);
vec2 upper_right_corner = laptop.aria_center.xz + (laptop.aria.xy / 2.0);
vec2 uv = pos.xz / (upper_right_corner - bottom_left_corner);
after that utilizing the uv coords, we will pattern a simplex noise texture to have top multiplier or width multiplier we will additionally add some phrases for consumer management utilizing simplex noise for top is sensible because the tall grass tends to stay collectively in actual life.
Bend Time period
This can be a time period to outline how bendable a grass blade is which can be used later as a multiplier in vertex shader to do the animation.
Animation time period
This can be a time period utilized by the animation method to animate the grass it is the identical for all the vertices so calculating it right here saves us time and permits us to not cross uvs to vertex shader which is 2 floats however for thousands and thousands of blads it will likely be 100s of megabytes. It is calculated by the next
float wind(vec2 uv, float time, float base_freq,
float freq_scale, float power) {
float noise_factor = size(pcg2d(uvec2(uv * 104234.f)));
float freq = base_freq + sin(time) * freq_scale;
vec2 uv_displaced = uv + power;
vec2 uv_scaled = uv_displaced * freq;
float sin_term = uv_scaled.x + uv_scaled.y + noise_factor;
return sin_term;
}
Within the vertex shader this worth can be used as a parameter for the sin perform which is able to end in a kinda wind like wave, you possibly can see for your self right here in shader toy
It is known as sin_term
as a result of I am going to cross it to sin perform later within the vertex shader.
Pitch
defines the angle of rotation across the UP axis, which is at all times {0, 1, 0}
in our case the grass will at all times level upwards, so all we’d like is simply an angle to assemble a rotation matrix within the vertex shader. You can also make this random or at all times face the digicam or be managed by the consumer no matter fits your wants.
Frustum culling
For the frustum culling, we begin by producing a sphere round every blade the radius of the circle is set by the max of top and width then we rework the sphere to digicam house, at this level the gap from the digicam is simply the size of the purpose as in digicam house the digicam is at 0, 0, 0
we use this to use the cutoff distance and select LOD.
float radius = blade_height >= blade_width ? blade_height : blade_width;
vec4 heart = vec4(pos, 1.0);
heart.x += (blade_width / 2);
heart.y += (blade_height / 2);
heart = laptop.per_frame.view * heart;
float dist_from_cam = distance(heart.xyz);
const float cutoff_dist = 800;
const float low_lod_dist = 200;
bool seen = (dist_from_cam < cutoff_dist);
bool low_lod = (dist_from_cam > low_lod_dist);
then the frustum culling we kinda doing the projection by hand after which examine if the sphere is in vary, I solely do culling on the x and y axes as for the z axis we have already got a cutoff however including that can also be trivial. I realized this fashion of culling on Arseny Kapoulkine’s stream they clarify it significantly better however mainly, we extract the left or proper airplane(we’d like simply certainly one of them) and the highest or backside airplane then on the GPU we calculate the dot product between the sphere heart and planes whereas taking the abs of the x part of the middle of the sphere to do the culling on either side at similar time using its symmetry.
seen = seen &&
heart.z * frustum[1] + abs(heart.x) * frustum[0] < radius;
seen = seen &&
heart.z * frustum[3] + abs(heart.y) * frustum[2] < radius;
After that, we fill the blade information within the respective index within the blades knowledge buffer
uint buf_index = laptop.blades_number.x * i.y + i.x;
laptop.grass.knowledge[buf_index].pos_bend.xyz = pos;
laptop.grass.knowledge[buf_index].pos_bend.w = bend_factor;
laptop.grass.knowledge[buf_index].anim_size.x = blade_width;
laptop.grass.knowledge[buf_index].anim_size.y = blade_height;
laptop.grass.knowledge[buf_index].anim_size.z = pitch;
laptop.grass.knowledge[buf_index].anim_size.w = sin_term;
Draw Command Buffers
After producing the info we will fill the command buffer, every thread will atomically improve the variety of cases within the commands buffer this enables us to know to make use of the occasion rely as an index in one other buffer to retailer the indices of the seen blades which permits us to entry the values utilizing gl_InstaceIndex
we will additionally copy the buffer and kind utilizing prefix sum scan similar to Acerola in his video however that is higher memory-wise and doubtless efficiency sensible however I did not measure efficiency. in abstract, the vertex shader will use gl_InstaceIndex
to index right into a buffer that accommodates the indices of the seen grass blades.
The compute shader makes use of two oblique draw instructions one for prime LOD and different for low LOD we will add as many LOD ranges as we would like, and we examine if it is low LOD or excessive after which improve gl_InstanceIndex
within the respective command buffer.
bool low_lod = dist_from_cam > low_lod_dist;
if (seen) {
uint cmd_index = uint(low_lod);
uint index_in_visible = atomicAdd(
laptop.cmds.knowledge[cmd_index].instance_count,
1
);
}
after that, we use index_in_visible
to index into the seen blades indices buffer and retailer the index of the present grass blade.
DrawIndices indices = low_lod ? laptop.visible_low_lod : laptop.seen;
indices.i[index_in_visible] = buf_index;
Now we now have our indices knowledge in a steady buffer to index into utilizing gl_InstanceIndex
and gl_DrawID
to find out which indices buffer to learn from. We’re prepared to attract the grass blades.
Rendering
Within the vertex shader we begin by pulling the Blade knowledge respective to the present occasion.
DrawIndices seen = gl_DrawID == 1 ? laptop.visible_low_lod : laptop.seen;
uint i_visible = seen.i[gl_InstanceIndex];
GrassBlade blade = laptop.grass.knowledge[i_visible];
After that, we assemble a rotation matrix and apply the peak and width multiplier.
float sin_pitch = sin(blade.anim_size.z);
float cos_pitch = cos(blade.anim_size.z);
float height_multiplier = blade.anim_size.y;
float width_multiplier = blade.anim_size.x;
mat3 rotation = {
{cos_pitch, 0, -sin_pitch},
{0, 1, 0},
{sin_pitch, 0, cos_pitch},
};
vec3 v_pos = rotation *
vec3(v.x * width_multiplier, v.y * height_multiplier, v.z);
Then we use the sin time period to animate the grass blade utilizing a sin perform we scroll with time and the peak of the vertex as a result of naturally the tip of of the grass blade skew greater than the bottom.
float sin_term = blade.anim_size.w;
float bend = sin(sin_term + laptop.time + (blade.pos_bend.w * pos.y));
pos.z += bend * pos.y;
right here is the way it seems to be
For the colour, I opted for a easy gradient that goes brighter because it will get increased. I plan to enhance this for instance utilizing normals for the grass additionally add some specular lighting because it may look very nice for instance like Ghost of Tsushima’s grass.
Optimization
The only factor I considered was simply to cut back the quantity of the work the vertex shader does since it should run thousands and thousands of instances a low hanging fruit was multiplying the projection and think about matrix on the CPU and have it prepared for the vertex shader, the subsequent factor we will do is optimize the grass blade mesh I’ve used Mesh Optimizer by Arseny Kapoulkine I used it and did a number of optimizations and the one who had essentially the most influence was changing the grass blade from a triangle checklist to a triangle strip that lowered the variety of vertex shader invocations drastically and nearly reduce the vertex shader work in half and the form of the grass blade could be represented properly as a strip.
very tough numbers
Notice: be aware I bought the numbers utilizing Vulkan’s timestamp queries
On my RX5600XT utilizing the open source drivers(RADV) on Linux in 1080p resleoution my GPU can course of about 6’770’688 grass blade with all of them seen the compute shader takes about 4.7ms
and drawing it takes round 8ms
that is about it greater than that it drop blew 60fps. Rising the realm coated by the grass to 1000 by 1000 we will think about as much as 19’066’880 grass blade with compute shader taking about 2.5ms
and drawing them takes about 6ms
.