Saturday, November 21, 2015

Real-time Raytracing part 3

In part 2, I showed you that you can speed up your ray tracing algorithm by means of a spatial data structure. In this post I would like to address optimizing GPU code using CUDA. Below, I give some general tips to boost parallel performance. Be aware that these points will not always be relevant. I will explain why, and under which conditions these points can optimize your algorithm:
  • Try to avoid divergent statements and recursive algorithms. 
  • Keep track of memory alignment for reads and writes of large datasets. 

Divergence

Why is this so important? Most of the performance issues can be devoted to divergence. To explain divergence, I'm going to show a simple example. Say we want to multiply all even numbers by 2, and divide all uneven number by 2. A simple parallel reduction creates us the following:
__device__ void divergent(int* array, int nNumbers)
{
   int idx = threadIdx.x + blockDim.x * blockIdx.x;

   for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x)
   {
     if (array[i] % 2 == 0)
       array[i] *= 2;
     else
       array[i] /= 2;
   }
}

CUDA can run thousands of threads in parallel, but it comes with a small trick called warps. These warps also exist in other GPU specific languages like OpenCL or compute shaders, but they are referred to as wavefronts. CUDA launches a set of 32 threads as one simultaneous bunch called a warp. Here is step by step what happens while executing one warp:

  1. The 32 threads each calculate their respective idx.
  2. The starting point of the for loop is the same as idx.
  3. The first conditional statement: all of the threads for which this statement is true (16 threads) will execute, the other threads (16) are idling.
  4. The second conditional statement: same as the first, only the other way around.
  5. Increase the loop index, and repeat from 3 until the end of the loop.

In total we've just executed the same amount of statements as in a serial variant of this algorithm. However, the serial variant does not have any idle time. If you would sum the total time spent executing this algorithm for both variants, the serial algorithm would take less total time than this parallel variant. Simply because half of the threads are idling all the time!

This is where the term divergence originated from: divergent statements blocking full parallel execution in warps or wavefronts. In the example above, we have at maximum 50% divergence, meaning half of the threads follow a different path than linear code execution. So, how do we get the algorithm from above to execute in parallel without divergence? Fairly simple:
__device__ void divergent(int* array, int nNumbers)
{
  int idx = threadIdx.x + blockDim.x * blockIdx.x;

  idx *= 2; // All even numbers
  for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x*2)
    array[i] *= 2;

  idx++; // All uneven numbers
  for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x*2)
    array[i] /= 2;
}

By removing the divergent statements, we get code that leaves no thread idling. Of course this was a simple example, and there are many branching statements which are impossible to solve the same way as above.

Recapping on point 1: Try to avoid divergent statements and recursive algorithms. It's incredibly easy to make a recursive call which causes divergence in the parallel execution. Keep in mind that the recursive algorithms are just a point of view: if you can guarantee non-divergent recursive calls, there is nothing wrong with recursive calls!

Memory alignment

To explain memory alignment I'm going to use an example from raytracing. In order to optimize my path tracer, I had to split the GPU code in different kernels for execution. I will detail the process in the next post. For now, we have to save the ray state in a datastructure as small as possible to avoid high memory latency. This is what I came up with:
struct RayState 
{ 
  Ray ray; // contains position and direction (both 3D vectors)
  float minDist; 
  int triangleID; 
};

The total size of this structure is a simple sum: the amount of floats and integers times their size in bytes. I'm assuming that an integer and float have the same size, because in most compilers they do. With this assumption, in total we have 8 times the size of a float, which is usually 32 bits or 4 bytes.

If you would use this structure to load and save data, there is a small problem: this structure will read in 4 bytes a time, since we did not tell CUDA there is any alignment whatsoever. You can profile this behavior with NSight, which shows the following in memory transactions:



Our tactical choice lead CUDA to read in by the lowest alignment, which is 32 bits. This can be a problem, if you look at this graph:



This was measured for a Tesla C2050, which is a bit outdated by now, but by no means a bad statistic to look at. There is a clear trend showing from this graph: the lower the amount of bits read per transaction, the slower memory access speed. This statement is only true if the amount of threads per multiprocessor is low. This is true in our optimized path tracer, but again, more on this in the next post.

This is where the magic happens. As described in the endless documentation for CUDA (CUDA programming guide), we can declare aligned memory on the device. Which allows us to read our list of structures as an array, if you know the memory size of a single element. We can read our structure as if it is an array of float4 elements (128 bits). The following piece of code does just this:
// Interpret as float4 array, and transform to structure
float4 r1 = ((float4*)rays)[rayidx * 2];
float4 r2 = ((float4*)rays)[rayidx * 2 + 1];
float3 raypos = make_float3(r1.x, r1.y, r1.z);
float3 raydir = make_float3(r1.w, r2.x, r2.y);
float2 data = make_float2(r2.z, r2.w);

I read in two float4 elements from the array by converting the index. This is a simple trick: we know the total size of the structure, which is 256 bits. We are trying to read in 128 bit elements, so the index has to be multiplied by two to get a correct index for a float4. A profiling session now shows the following data:



Which means the data is now read in as aligned 128 bit reads. Mission successful! Recapping on point 2: Keep track of memory alignment for reads and writes of large datasets. We've successfully read in aligned data on the GPU. However, there is room for more memory optimization by using data locality. Since this has already become quite a long post, I will postpone this for another time.

Sources

CUDA programming guide

http://on-demand.gputechconf.com/gtc-xpress/2011/presentations/cuda_webinars_GlobalMemory.pdf