## Saturday, December 5, 2015

### Real-time Raytracing part 3.1

In part 3, I've shown some examples on how to tune algorithms on the GPU. Here I would like to address how we can apply those rules to optimize path tracing. As usual, I will show code samples for CUDA, but this doesn't mean it can't be applied on any other graphics programming language.

Fortunately for us, there is an opensource framework for optimized GPU based ray traversal. The framework is from the research paper Understanding the Efficiency of Ray Traversal on GPUs. In 2012 they added Kepler and Fermi Addendum, which added some changes to the framework to optimize traversal for those architectures. In this post I will be looking at the version for the kepler architecture. The framework is quite heavy on magic numbers and hard to read code. In this post I'm going to dissect some of the more important parts and how they improve performance.

Data structures

The first thing to notice is what kind of spatial data structure they are using. From the name of the class: "SplitBVHBuilder" you can already see that the builder is based on this paper. It comes down to a simple binned BVH builder, including triangle splits to optimize traversal. For more information on BVH techniques, you can read part 2 of this series.

More important for this post is how you store this BVH on the GPU and access it without creating any memory latency. If we look at the ray traversal, the following code shows how the nodes are read from memory:

    const float4 n0xy = tex1Dfetch(t_nodesA, nodeAddr + 0); // (c0.lo.x, c0.hi.x, c0.lo.y, c0.hi.y)
const float4 n1xy = tex1Dfetch(t_nodesA, nodeAddr + 1); // (c1.lo.x, c1.hi.x, c1.lo.y, c1.hi.y)
const float4 nz   = tex1Dfetch(t_nodesA, nodeAddr + 2); // (c0.lo.z, c0.hi.z, c1.lo.z, c1.hi.z)
float4 tmp  = tex1Dfetch(t_nodesA, nodeAddr + 3); // child_index0, child_index1
int2  cnodes= *(int2*)&tmp;


Recalling from part 3, I showed you a similar statement for memory alignment using the fact that arrays in global memory didn't have data fragmentation. Reading memory from a texture using this principle is more natural, because textures ensure no data fragmentation. As seen in the comments above, the information gathered contains two BVH bounding boxes and two child indices. Storing the data this way ensures you have four 128 bit reads and one 64 bit read, since the last two values are unused.

The other data structure I would like to point out is the way they store triangles. The nodeaddr variable can store either positive or negative values. The positive indicate an index in the BVH and negative values store indices to the triangle array. The actual index is found by using the bit negate function ($\sim$) on the index. From this address they fetch the triangle using similar code as above:

    // Tris in TEX (good to fetch as a single batch)
const float4 v00 = tex1Dfetch(t_trisA, triAddr + 0);
const float4 v11 = tex1Dfetch(t_trisA, triAddr + 1);
const float4 v22 = tex1Dfetch(t_trisA, triAddr + 2);


In which you can store the data required for a ray-triangle test and execute it. The thing to note is: BVHs can store more than one triangle per node, especially SBVHs. In order to store this efficiently in one array, they inserted breaks in the array. The code for checking all triangles:

for (int triAddr = ~leafAddr;; triAddr += 3)
{
// Read triangle data (see above)

// End marker (negative zero) => all triangles processed.
if (__float_as_int(v00.x) == 0x80000000)
break;

// Ray-triangle intersection (see part 1)
}


The for loop determines the triangle address and loops endlessly increasing the triangle address. If the first address contains a stop value (they chose for negative zero, but any distinguished value will do), stop the for loop. Simple as that, the array can now contain more than one triangle per node.

Traversal algorithm

The traversal algorithm is as follows: beginning this algorithm assumes the ray hit the BVH containing the model we wish to test for. If it didn't, it's not that bad, after two ray-box intersections the algorithm stops.

• Starting from the first node address, we read the first two child bounding boxes using the code above
• Using the optimized intersection test for the GPU from the paper, we test ray-box intersections.
• If the ray intersects one of them, we change the node address and continue the algorithm.
• If we intersect both children, store one of the children on the stack and continue traversing the other.
• If we intersect none, pull a child index from the stack. If the stack is empty, stop the algorithm.
While this algorithm is pretty easy to implement, optimizing it for the GPU is quite a hassle. In order to minimize divergence, they thought of two neat tricks to alter this algorithm. The first trick is persistent threads. This concept originated from the following thought: A warp will execute until all threads inside it are done processing. In a path tracer like ours this can lead to horrific situations: 31 threads waiting for just the one to be done tracing.

The idea to combat this is quite simple: if there are less than X threads active, fetch new work for the idle threads. Using older graphics hardware, the implementation would take some time for all threads to communicate their status using shuffle functions. Since kepler architecture there are some warp voting functions introduced. From the CUDA toolkit:
__all(predicate):
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.
__any(predicate):
Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.
__ballot(predicate):
Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

Which comes in very useful for this task. The code that remains to execute this test:
if( __popc(__ballot(true)) < DYNAMIC_FETCH_THRESHOLD )
break;

Where __popc is a population count, counting the number of bits equal to 1 from the provided value.

The other trick in the algorithm to minimize divergence is part of triangle checking. It comes down to timing: when to check for triangles. In theory, the threads in the warp could be divergent: one thread checking ray-triangle intersections and the others traversing the BVH. While this is not completely preventable, we could help the algorithm postpone checking triangles until all threads have at least one triangle to check.

By creating an array (or in this case a variable) to store a triangle address, we can continue traversing the BVH until other threads have also found a triangle. The code in the source uses a direct compiled statement to check this, since it wouldn't be compiled in an optimal way. The code that you would need to check if all threads have a triangle is simple:
    if(!__any(leafAddr >= 0))
break;
Where leafAddr indicates a triangle address. The statement simply checks if there is a thread for which the address is larger than or equal to 0. If there are none, it means all threads have found a triangle to check and we can stop traversing the BVH.

I hope to have cleared up some of the magic parts of the source code. Of course, if you have any more questions about it, feel free to post them below.