Between tiled and clustered lighting methods, I finally chose clustered lighting for our project. The detailed reasons will be described in later articles.
Lights vs Clusters
The first step of a clustered lighting algorithm is separating the camera frustum into clusters. How to separate the frustum actually affects two things: the efficiency of computing and the memory consumption because we need to store the light list of every cluster.
After projection, the camera frustum is actually the screen. So in the X and Y direction, we separate the frustum into screen aligned square tiles just like the tiled lighting. As to separation in the Z direction, it’s related to how we will do the light-cluster intersection. In tiled lighting, the tile size is fixed, but we can’t make any range assumption in Z direction because the depth may be varying from near clip to far clip in any value.
No matter how you split the frustum with clusters, the exact shape of a cluster is a frustum. And for punctual lights, spot light has a spot shape and point light has a sphere shape. So we are doing frustum-cone and frustum-sphere intersections. So do accurate intersection or approximate it? On accurate frustum-sphere intersection can be done with separate plane intersections, involves 6 plane-sphere intersections per cluster. When it comes to frustum-spot intersection, the computation becomes even more complex. If we do approximation, we can approximate the cluster frustum with some simpler shapes, AABB or sphere. For a arbitrarily rotated 3d frustum with arbitrary extent, AABB is much better under best case and sphere is much better under worst case. But if the frustum is a cube, AABB is slightly better under best case and sphere is still much better under worst case. Sphere bounding is more stable under this situation, So we try to separate clusters as cubic as possible and use a sphere bounding to do the final intersection . Now it’s finally a sphere-cone and sphere-sphere intersection problem. Sphere-sphere is quite simple. To intersect a sphere against a cone, we enlarge the cone by the size of the radius of the sphere. Then the intersection problem is transformed into a “Is a point inside a cone” problem which is very easy to do (Figure 1).
To make the sphere bound of one cluster as close as possible to the cluster shape, we should make the cluster as cubic as possible. In the X and Y directions, the clusters are aligned to screen space tiles, which means in the world/view space, the further the cluster from the near plane, the bigger a cluster’s size is. So we divide the frustum in the Z direction with an exponential factor. (Figure 2)
Store the light-cluster information
After the calculation is done, we need to store the results in memory for later shading usage. The granularity of a cluster is smaller than a tile, and amount of the clusters is also much larger than tiles. A smaller granularity results in a more accurate intersecting result, but the amount of light list of every cluster may vary a lot. The large amount of clusters also needs more storage space for lights lists. So a start offset index buffer with a light link list buffer is a pretty flexible solution for this.
The start offset buffer is a 1D array which is the same size as the cluster count. For every cluster, it has a light list stored in the light link list buffer. The value in the start offset buffer is the index of the first element of the light list corresponding to this cluster. If there are no lights intersecting with this cluster, its value is 0XFFFFFFFF. After finding the first index of a light list, the value contains two parts: the light index value and the next index value of the next node in the list. These two values are packed into a 32-bit uint because they are both positive integers. The light index value is the index of the actual lights rendering data in the lights rendering data compute buffer or constant buffer depending on your choice.
We know that on GPU, hundreds or thousands of works are done simultaneously. Synchronization is a hard job. So we make use of atomic operations to confirm that the right value of the index was written to the start offset index buffer. So the start offset buffer is a byte address buffer. The light link list buffer is a counter buffer so we can get the actual index number that’s currently in use. Once a compute thread find an intersection between a light and a cluster, we can get its position in the start offset index buffer by its thread ID. Then we increment the light link list buffer to get the element we are going to write the light index and the linked node index into. The incremented index is written into the start offset index buffer by use InterlockedExchange() and output the last written index in the start offset index buffer. Now we’re gonna pack the light index of the intersected light and the last written index and store it in the light link list buffer.
void LightCulling(uint3 globalIdx : SV_DispatchThreadID)
float4 centerRadius = CalculateClutserBound(globalIdx);
uint index = GetClusterIndex(globalIdx);
for(uint i = 0; i < _PunctualLightsCount; i++)
uint currentOffset = _LightLinkListBuffer.IncrementCounter();
_StartOffsetIndexBuffer.InterlockedExchange(index, currentOffset, prevOffset);
uint lightLinkNode = (i << 24) & (prevOffset & 0xFFFFFF);
_LightLinkListBuffer[currentOffset] = lightLinkNode;
Let’s give it an example. Assuming the screen resolution is 1920 x 1080, we separate the camera frustum into 32 px aligned tiles and 128 slices in the Z direction. Then we got 261,120(60 x 34 x 128) clusters totally. Quite a lot, huh? Then the light fetching process is just like the image below (Figure 3).
During final shading, for every pixel, we need to fetch the light list to its corresponding cluster id which is easy to get from screen space position and depth no matter forward or deferred. Loop the light list and do whatever you want with the corresponding light data in the light data buffer.
Just like it’s described in the last section, for a 1080p resolution which is commonly used in current gen, the cluster count is 261,120. The start offset index buffer is the same size. Every element of the start offset buffer is a 32-bit UINT. So the final buffer size is nearly 1MB.
What about the light link list buffer? The actual size we allocate for the light list buffer heavily depends on the lights count in the game scene, to be more accurate, the visible lights count during every frame. To make the reallocation of the buffer as few as possible, it may be the best to always deal with the worst case – allocating a light list of the amount of all the lights in current scene for every cluster. This means that the light link list buffer’s size is lights-count times the size of the start offset index buffer.
If we have 256 lights in the current scene, giving the worst case, the light list buffer must have 256 lights list for every cluster. The final buffer size will be nearly 256MB. It’s a pretty big size for a single buffer. However, the worst case can barely happen in the actual gameplay. The worst case means that: first, inside the current view, every light in the scene is visible and, second, for every cluster in the frustum, it’s intersected with all the lights. The first condition barely happens because of all kinds of culling methods in game engines, frustum culling, occlusion culling, etc. The second can only happen when you have a lot of lights that have a very large extent. So the actual size of the light link list buffer heavily depends on the lighting condition. Walk through your scene, find the maximum visible lights count and allocate your buffer with that size is feasible. Giving the second condition of the worst case, the size can be even smaller because the light link list buffer is only incremented if there is the light pass the intersection. So the storage is tightly aligned and no useless values in between each node. But this needs more test because the intersecting result varies a lot according to camera view changes.
Data precision issues
Still assuming a 1080p resolution, 32-pixel tile size
If the total lights count is much larger than 64, you maybe should rearrange the bits for the two index parts. If the 32 bits cannot satisfy your lighting complexity, then you have to go with 64-bit data types, and of course, the memory consumption doubles.
For now, 8-bit light index and 24-bit next node index works fine with our project.
Extends lights to other primitives
Other data that can be represented by easy-to-intersect shapes can also be added to this culling process on GPU. For example, decals, reflection probes, etc. As to the storage, a specific index link list for every kind of data can be easy to implement but memory consumption is data type count multiplied. Or we can still separate the node bits into more parts but the logical complexity increases.
-  https://bartwronski.com/2017/04/13/cull-that-cone/
-  Assigning arbitrarily shaped convex light volumes using conservative rasterization
-  Clustered Deferred and Forward Shading