Level-of-Detail Independent Voxel-Based Surface Approximations was the subject of my master thesis. I wrote a small dissemination that explains the basics of my thesis on this page.

This image shows the final result of my thesis work. The models above are voxel models with 4096 (2^12) voxels in every axis. If they were all filled, I would have to store 4096^3 = 68719476736 voxels in total. There has been a lot of research into compressing the huge amount of data this requires, I mentioned some examples on the thesis page.

Using a Sparse Voxel Octree (SVO) storing scalar field values, the six models above can be stored in 12GB of memory total. Using my multiresolution method we can store visually comparable models in only 2GB of memory total.

Here is a small video showing the current state of the voxel path tracer:

For my master thesis, I've been working on a project for some time now. In this post I'll share a few debug screenshots of the current state of the project.

The project can load triangle meshes of (almost) any format, and convert them to large resolution sparse voxel octrees (SVO). In the picture above, you can see the voxelized Lucy model, which is about 500MB in triangle format.

Converted to an SVO, storing only which voxels are on or off, we can store data for up to 4096^3 voxels in only ~40MB. That's a boolean for ~68 billion voxels. But since we only store voxels that are on, this number changes drastically, thus we are able to compress this pretty well.

If you want to store more data for every voxel, the amount of memory is going to skyrocket. In the example above, I store a scalar field for every voxel (that's 8 floating point numbers per voxel). All that data nets nearly 870MB for the highest resolution.

I was able to convert and draw a model up to 8192^3 resolution, including all simplifications of a dragon model:

Where you can see that the voxel resolution is even larger than the millions of triangles stored in the model.

For my master thesis I will be working on creating level of detail simplifications. The screenshots below show some results of a linear approximation calculated for a group of voxels:

The first model at 512^3 resolution, and the second at 2048^3.

These images are rendered using raymarching to find the intersection point of a ray and a scalar field. I wrote the raytracer in CUDA, which allows for an interactive framerate even with gigantic data sets by using parallelization on the GPU.

Let's talk game engines. For a current project, I'm writing a game engine completely from scratch. Before I started I wrote down some key functionalities the engine should be able to handle. After writing that down, the proces of creating the engine boils down to optimizing the functionality in terms of performance and memory requirements.

Before we dive into engine compositions, let's look at two big AAA game engines used to create games which are used by a large number of developers today.

Unreal Engine 4 is mostly known because of the impressive graphics that can be created within a few mouseclicks. The engine is open source, and is an all-round engine that supports nearly everything you can think of when creating a game. Below you will find some amazing pictures rendered using Unreal Engine 4.

Unity is an all-round game engine designed to ease the development process by working with C# as its core language. Unity is a lot more oriented towards an entity component system than Unreal Engine 4.

These engines took years to develop, and can practically support any use case. For my engine I'm looking to only support a very narrow subset of these use cases.

The thing both engines have in common is that they are both entity component systems, which is what the remainder of this post is all about. Engine compositions
Mainly, there are two compositions for game engines. The first one is Object Oriented Programming (OOP), and the second type is an Entity Component System (ECS). There are already a lot of resources available to learn about both. I compiled a small list of examples, and will revisit the topic briefly.

Object Oriented Programming
The main reason to use OOP for games, was that the concept is easy to grasp. You create a hierarchy for all objects, and code reuse is introduced by polymorphism. In the diagram below you can see an example of OOP used in a game engine.

Image from [1]

It's clear to see that the EvilTree can't fit in the hierarchy, because it would require inheritance from both static and dynamic entities. While this is possible in some languages, it can lead to difficulties known as the diamond problem.

Entity Component System
The entity component system is designed to solve the problem stated above. The structure above would look as follows in an ECS:

Image from [1]

Where you can see that all the objects are derived from different components. The components can be reused for any object, this makes it easy to add new entities.

The next step for Entity Component Systems
Now that we are up to speed with game engine compositions, let's look at the path forward.

Data Oriented Design (DOD)
This paradigm is the design principle behind the ECS system. The basic principle is instead of looking at code, look at data. This concept is derived from the fact that most applications are memory bound instead of processing power.

The main point for DOD in game engines focusses on using the ECS in a way that avoids cache misses. To illustrate this, we look at the table below.

Image from [3]

In this ECS from [3], we can see on the left all the entities in the game, and in the table all of the components they exist of. When viewed from a higher scale, the only thing we changed is instead of looking at the diagram from left to right, we now look at it from top to bottom. So in our code we define our components as a list of Position components, Movement components, and so on.

For every entity we add, we simply add one of each component to all component arrays. This is why there are holes in the diagram.

All the way at the top of this article, I said that the engine is a balance between performance and memory. We can now clearly see why this is the case: this layout consumes a lot of memory to increase the performance. The "Script only" object creates 5 empty components which will never be used.

The three principles
We can define three points to weigh all our engine needs.

Cache efficiency

Memory efficiency

Parallel efficiency

Now we have to define how to determine the efficiency of all of these points.

Measuring cache efficiency is tricky at best, and this is very dependent on optimization. So instead of a ratio, we define the following: An engine becomes more efficient if we require fewer arrays available at the same time, and the smaller the size of a single item in the array the better.

The ratio for memory efficiency is defined as the total used components divided by the total created components.

We can determine the parallel efficiency by looking at the percentage of work that can be executed in parallel. For a complete calculation of the speedup, we can look at Amdahl's law.

ECS revisited
Now that we have the three principles, we can discuss the efficiency of the ECS system. If we assume that we are using the ECS from [3]:

We can see that this would lead to a good cache efficiency, but not the best: A physics component can't operate without a position component, this means we need multiple arrays simultaneously when updating entities. But the big advantage here is that all the arrays are separated, and can be queried individually per component.

The memory efficiency is bad. The ratio for this particular example is already 19/30 = 63%. That means that 37% of our memory is just thrown away for the sake of efficiency.

The parallel efficiency leads to a tricky scenario: how can you execute ECS in parallel? In this particular case, we pretty much can't. The physics system updates the position component, and only after that, we can render the component using the new position.

In order to process everything in parallel, we'd have to look from left to right (per object) again, and processing the objects in parallel would require ALL data available in caches, which in turn would lead to a lot of cache misses..

In conclusion, we can say that we traded off space for time. We require more memory, but less processing time due to cache efficiency to process all our objects.

As for parallel efficiency, from a DOD composition such as this, we can run some tasks in parallel. If we look at the implementation from [2], every entity has a signature:

Image from [2]

In this case, the signature tells us that we require AI and Enemy components. In the implementation, we have systems that update all entities containing specific signatures. To create a simple parallel processing ECS, we can simply say that every set of systems that don't have any component in common can execute at the same time.

In our example, with a physics system and rendering system, this would not work, since they both rely on a position component.

A solution presented in Vittorio Romeo's presentation is to store all the components in one mega-array:

Image from [2]

While this could be the solution, I very much disliked the amount of work required to implement this, and keep up with all the overhead of adding and removing components.

Setting up the engine
In my engine I opted for an ideal solution, within the constraints of C++ (static types, known at compile time). With the power of C++ 11, we can reach a lot using variadic templates, and I built a lot of my implementation with it.

I based my system on the implementation of [2]. I use signatures to define a set of components, and I also define systems by providing a signature. But instead of storing components separately in one array per component type, I took a step back in 'engine progression'. I define one array for every possible signature.

If you read that carefully, you should already be thinking, that would require a lot of arrays! If we define up to 64 different components, and generate one array for every possible signature, we would have 1.8446744073e+19 arrays. Possibly requiring more memory than the complete application that we're building.

So instead of creating one array for all possible signatures, let's create one array for every signature used. The tricky part is finding all possible signatures at compile time. And we can't do that in C++, since we have no such library as reflection from C#, where we could query all that.

Text templates
We will use C# to create our signatures before the C++ compile time, so we know all the types at compile time. For this we will use something commonly used in web development: text templates. If it's your first time working with these in visual studio, I recommend the syntax highlight plugin.

So in text templates, we can write code that writes code, quite nifty. A small example that generates a list of all our components:

Where I define a function "GetComponents()" in a DLL CodeGeneratorFunctions, to search all our files for component signatures. I was pleasantly surprised with the runtime, as I thought it would take years to search all those files, but it actually took less than a couple of seconds.

Similar to this example, I wrote text templates for all signatures and systems.

The engine
Back to engine talk, now that we have our component composition, let's determine how efficient this engine could be in theory.

Cache efficiency. At first glance, the cache efficiency would suffer in comparison with the original ECS from [3], but it's hard to tell. We have a larger single item size in the array, but we only require one array simultaneously. Below I explain a small optimization that reduces the individual item size.

Our memory efficiency is great. Since we can basically store only the necessary components for every object, we waste 0% of memory, with a little overhead of creating a lot of arrays in larger systems.

The parallel efficiency is similar to the ordinary ECS. We can only execute systems in parallel when they contain unique component sets.

So in theory, we got an advantage in terms of memory. But we are still questioning if the cache efficiency changed. Looking back at the mega-array structure, we can see that we improved a little, we can store our velocity and position together in one array, so we don't have to acces two separate arrays and have cache misses.

Optimizations

The main disadvantage of this structure is when you use larger objects. If we have an object storing a lot of components, the item size of the array is going to be large. And a large item size means cache misses. We can make our engine more cache friendly by using hot/cold data separation.

As example, we consider objects storing data for physics (position, velocity) and rendering (huge models of 300MB). Loading only two of these objects, is a guaranteed cache miss, since the two objects are 300MB apart in the memory lane. The solution is to store a reference to the model, rather than storing it completely, and only call the model when it's required for rendering. This way we can update the physics without cache misses (a pointer is only 8 bytes).

The hot/cold data separation separates hot data (used multiple times) versus cold data (used sparsely). If you want to read more, I suggest reading the article on gameprogrammingpatterns about this topic.

To improve up on the parallel execution of the system, we can schedule all the work of one system in parallel. Since one system will be handling multiple arrays of objects, the first option is to execute all of these in parallel, and the second option is to execute all of the items in one array in parallel. Since the arrays have different lengths, the first option would be a bad choice (one thread would take very long, while the others would be waiting). Thus linearly processing all arrays, and execute all items inside that array is the best option. Further optimizations can be done by using SIMD instructions, but that is outside the scope for this article.

Future work
I'd like to see if I can increase parallel efficiency by introducing tech from Naughty Dog's engine. They have a great presentation about it available online. Instead of introducing parallel execution per system, I'd like to see if we can create a set of instructions per signature and create a fiber to execute that.

When I'm done creating the engine (which is never, because it's a game engine) I will compile some benchmarks compared to a 'normal' ECS implementation. I also hope to release some source code from this engine.

Update September 2016: CUDA 8.0 is available, Visual studio update 3 is supported. Update June 2016: CUDA 8.0 RC is available. Visual studio 2015 is supported, but update 2 is not yet included.

In this small post I will explain how you can use CUDA 7.5 in Visual Studio 2015. I don't claim to have full support for VS2015, merely using the VS2015 editor and compiling a VS2013 project. You will still need to have Visual Studio 2013 installed, with the CUDA toolkit extension.

On a separate note: Nsight does have support for VS2015, so no hacks required there!

What you need:

Any C++ project using CUDA.

CUDA 7.5 toolkit with VS2013 support installed (I only tested it with 7.5, but I can imagine other versions working as well)

In the project properties, make sure the project compiles for "v120", which is VS2013.

To actually load the project in VS2015 having support for compiling CUDA code, we have to copy some files. Visual Studio uses targets to include several extensions for loading project files. We simply have to copy the VS2013 support to the VS2015 folder:

The CUDA 7.5 extension files for VS2013 are located in:

In all of the previous posts, I've talked about optimizing ray tracing. I've mentioned path tracing, but never spoke of path tracing specific optimizations. This post will be a small survey of implementations and research about path tracing optimizations.

Foundations
Path tracing is great for simulating global illumination by tracing rays, and of course, these rays have to follow "realistic paths". As observed a long time ago (1760) by Johann Lambert, intensity of light from an ideal diffuse surface is directly proportional to the cosine of the angle $\theta$ between the direction of the incident light and the surface normal. This is called Lambert's cosine law in optics.

In path tracing, we mainly use the cosine law to calculate the contribution of light from the next ray. When tracing from a diffuse surface, the direction of the next ray is usually calculated by generating a random direction in the hemisphere of the surface. When tracing from a mirror, you can imagine there is only one direction on the hemisphere that is correct: the perfect reflection. Metal-like surfaces contain both elements, called glossy reflections. This image shows what I'm talking about in 2D:

If you would order them by calculation cost for the path tracing algorithm, the reflection would be the easiest, there is only one solution, one ray to calculate. The glossy one is harder to calculate, since you have to accumulate rays with different weights, but still simpler to fully calculate than the diffuse option. Besides these simple examples, materials can contain many functions to bend incoming rays. The collection of these functions describe the Bidirectional Reflectance Distribution Function (BRDF). Which in turn, you can even extend to allow for rays extending through the surface with a Bidirectional Scattering Distribution Function (BSDF).

Cosine weighted sampling

The intensity of light from an ideal diffuse surface is directly proportional to the cosine of the angle $\theta$ between the direction of the incident light and the surface normal.

This leads to an interesting optimization: cosine weighted sampling. Since the incoming and outgoing light is measured by a cosine, the outgoing rays nearly parallel to the surface don't add any (useful) information to the final solution. I could write this whole post about this optimization, but that would be a waste of time, since there are already many good examples out there. This post from Rory Driscoll shows results as well as the maths behind it.

This image shows the cosine weighted sampling on the top and the uniform sampling on the bottom. This was taken from the pathtracing blog

Sampling optimization
As mentioned above, when you've found a ray-triangle collision, the direction of the new ray is usually 'bruteforce' calculated by taking a random direction on the hemisphere. For diffuse surfaces, this is no problem, since every outgoing direction weights the same. If you would apply the same to perfectly reflecting surfaces, chances are you will most likely never find the correct outgoing ray.

By knowing the material properties you can determine how to sample the outgoing ray. This sampling strategy can be a very easy optimization if you're willing to end up with an approximation. If you look at the 3 examples from the image above: one variable for the degree of reflectiveness can produce these examples.

The problems with this reflectiveness variable for path tracing are the weights shown in the glossy example. Solving the Monte Carlo problem of path tracing requires a probability distribution function (PDF) over these samples. The outer rays in the glossy examples are shown smaller, because they affect the outcome less than the rays from the perfect reflection.

This optimization is as simple or as hard as you want to make it. The simpler you keep the sampling, the more approximate your solution is going to be. The best solution would be to sample your chosen BRDF. The cosine weighted sampling is shown to be perfect for diffuse only materials (diffuse materials have perfect lambertian reflectance). I found a sampling method for the Cook-Torrance BRDF in this paper: Microfacet Models for Refraction through Rough Surfaces.

Direct and indirect lighting

Direct lighting is easy to calculate. This is shown in almost all released 3D games out there. Simply calculating a dot product of the surface normal and light direction would result in perfect lambertian shading.

In path tracing we can apply an optimization to allow for direct light sampling. It's called next event estimation. This is where the math and statistics get nasty. To allow for direct light sampling, we have to apply a different probability distribution function. We have to sample the light and divide the contribution of direct light by the area of the light:

Area sampling. Image taken from [2]

Whereas the indirect illumination should be sampled by the original method using hemispheres. With one exception, all of the rays hitting the light should be discarded (since the light gets sampled by the other PDF from above).

Hemisphere sampling without direct light sampling (dashed lines). Image taken from [2]

These examples show only sampling of a single light, which is easy to maintain. Imagine a scene with thousands of lights and one sun. If you apply the same theory of the two PDFs from above, you would end up sampling more inefficient than the brute force sampling. The solution is rather simple, we add weights to each light. But these have to be based on different criteria: only adding weights based on distance could influence the effects of the sun (large distance, huge brightness) versus normal lights (small distance, small brightness). Other small optimizations could be added based on needs: you don't have to continually sample the sun during night time and so on. If you like to know the mathematics and details behind these two PDFs, I suggest you read the presentations from the references below.

Multiple importance sampling
In larger applications, thousands of material types and BRDFs are going to be used. Having a different sampling strategy for each of them is going to be time consuming and frankly undo-able. We already have two different PDFs to sample all of these materials, the direct and indirect sampling. The problem that arises from all of this shown in images:

These are the two sampling methods shown from above, the BSDF sampling is the hemisphere indirect sampling and the light source sampling is the direct sampling using the area of the light. The problem is the glossy reflection on the four plates. You can see that indirect sampling works if the light is large enough to have some collisions with rays, and the direct light source sampling works fine for small area lights.

Multiple Importance Sampling is a technique that can combine different sampling techniques to find a low variance estimate of the integral. By combining both sampling methods we can get the following result:

Images taken from [4]

Bidirectional path tracing
This leads us to more recent techniques in path tracing, namely bidirectional path tracing. This technique is based on the inverse of path tracing, instead of finding the light by tracing rays from the camera, find the camera by tracing from the light. These are the shooting rays, and the original camera rays the gathering rays. The bidirectional part indicates that these two types of rays can be combined.

This technique wasn't meant for real time path tracing applications which require responsiveness. Since for every gathering ray you shoot, the amount of calculation is doubled by also tracing shooting rays. This nearly doubles the time per frame, and thus decreases responsiveness. While you lose responsiveness, this algorithm increases the convergence rate for path tracing a lot, so in the end it's worth it.

As extension to bidirectional path tracing, there is an algorithm called Metropolis light transport. This algorithm allows finding more difficult light paths by extending existing paths. An example: A shooting and gathering ray have been found by the bidirectional raytracing method. Every bounce in these rays is recorded as a node. The metropolis light transport can add extra nodes on these rays to modify the existing ray to a new one. These nodes are placed in areas with high gradients. You can find more on this technique in this paper: Metropolis Light Transport.

Another extension I found, but have not fully explored is: Light Transport Simulation with Vertex Connection and Merging. The basic idea is to reuse existing paths by connecting vertices. This can increase the amount of generated paths per pixel:

Vertex merging: reusing paths to increase paths per pixel. Image taken from [6]

There is an opensource implementation of VCM called smallVCM. This implementation also contains multiple importance sampling and normal bidirectional path tracing.

[3] Walter, Bruce, et al. "Microfacet models for refraction through rough surfaces."Proceedings of the 18th Eurographics conference on Rendering Techniques. Eurographics Association, 2007.

[4] Veach, Eric. Robust monte carlo methods for light transport simulation. Diss. Stanford University, 1997.

[5] Veach, Eric, and Leonidas J. Guibas. "Metropolis light transport." Proceedings of the 24th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1997.

[6] Georgiev, Iliyan, et al. "Light transport simulation with vertex connection and merging." ACM Trans. Graph. 31.6 (2012): 192.

[7] Lehtinen, Jaakko, et al. "Gradient-domain metropolis light transport." ACM Transactions on Graphics (TOG) 32.4 (2013): 95.

[8] Manzi, Marco, et al. "Gradient-Domain Bidirectional Path Tracing." (2015). [9] Kettunen, Markus, et al. "Gradient-domain path tracing." ACM Transactions on Graphics (TOG) 34.4 (2015): 123.

In part 3, I've shown some examples on how to tune algorithms on the GPU. Here I would like to address how we can apply those rules to optimize path tracing. As usual, I will show code samples for CUDA, but this doesn't mean it can't be applied on any other graphics programming language.

Fortunately for us, there is an opensource framework for optimized GPU based ray traversal. The framework is from the research paper Understanding the Efficiency of Ray Traversal on GPUs. In 2012 they added Kepler and Fermi Addendum, which added some changes to the framework to optimize traversal for those architectures. In this post I will be looking at the version for the kepler architecture. The framework is quite heavy on magic numbers and hard to read code. In this post I'm going to dissect some of the more important parts and how they improve performance.

Data structures

The first thing to notice is what kind of spatial data structure they are using. From the name of the class: "SplitBVHBuilder" you can already see that the builder is based on this paper. It comes down to a simple binned BVH builder, including triangle splits to optimize traversal. For more information on BVH techniques, you can read part 2 of this series.

More important for this post is how you store this BVH on the GPU and access it without creating any memory latency. If we look at the ray traversal, the following code shows how the nodes are read from memory:

Recalling from part 3, I showed you a similar statement for memory alignment using the fact that arrays in global memory didn't have data fragmentation. Reading memory from a texture using this principle is more natural, because textures ensure no data fragmentation. As seen in the comments above, the information gathered contains two BVH bounding boxes and two child indices. Storing the data this way ensures you have four 128 bit reads and one 64 bit read, since the last two values are unused.

The other data structure I would like to point out is the way they store triangles. The nodeaddr variable can store either positive or negative values. The positive indicate an index in the BVH and negative values store indices to the triangle array. The actual index is found by using the bit negate function ($\sim$) on the index. From this address they fetch the triangle using similar code as above:

// Tris in TEX (good to fetch as a single batch)
const float4 v00 = tex1Dfetch(t_trisA, triAddr + 0);
const float4 v11 = tex1Dfetch(t_trisA, triAddr + 1);
const float4 v22 = tex1Dfetch(t_trisA, triAddr + 2);

In which you can store the data required for a ray-triangle test and execute it. The thing to note is: BVHs can store more than one triangle per node, especially SBVHs. In order to store this efficiently in one array, they inserted breaks in the array. The code for checking all triangles:

for (int triAddr = ~leafAddr;; triAddr += 3)
{
// Read triangle data (see above)
// End marker (negative zero) => all triangles processed.
if (__float_as_int(v00.x) == 0x80000000)
break;
// Ray-triangle intersection (see part 1)
}

The for loop determines the triangle address and loops endlessly increasing the triangle address. If the first address contains a stop value (they chose for negative zero, but any distinguished value will do), stop the for loop. Simple as that, the array can now contain more than one triangle per node.

Traversal algorithm

The traversal algorithm is as follows: beginning this algorithm assumes the ray hit the BVH containing the model we wish to test for. If it didn't, it's not that bad, after two ray-box intersections the algorithm stops.

Starting from the first node address, we read the first two child bounding boxes using the code above

Using the optimized intersection test for the GPU from the paper, we test ray-box intersections.

If the ray intersects one of them, we change the node address and continue the algorithm.

If we intersect both children, store one of the children on the stack and continue traversing the other.

If we intersect none, pull a child index from the stack. If the stack is empty, stop the algorithm.

While this algorithm is pretty easy to implement, optimizing it for the GPU is quite a hassle. In order to minimize divergence, they thought of two neat tricks to alter this algorithm. The first trick is persistent threads. This concept originated from the following thought: A warp will execute until all threads inside it are done processing. In a path tracer like ours this can lead to horrific situations: 31 threads waiting for just the one to be done tracing.

The idea to combat this is quite simple: if there are less than X threads active, fetch new work for the idle threads. Using older graphics hardware, the implementation would take some time for all threads to communicate their status using shuffle functions. Since kepler architecture there are some warp voting functions introduced. From the CUDA toolkit:

__all(predicate):

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.

__any(predicate):

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.

__ballot(predicate):

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

Which comes in very useful for this task. The code that remains to execute this test:

Where __popc is a population count, counting the number of bits equal to 1 from the provided value.

The other trick in the algorithm to minimize divergence is part of triangle checking. It comes down to timing: when to check for triangles. In theory, the threads in the warp could be divergent: one thread checking ray-triangle intersections and the others traversing the BVH. While this is not completely preventable, we could help the algorithm postpone checking triangles until all threads have at least one triangle to check.

By creating an array (or in this case a variable) to store a triangle address, we can continue traversing the BVH until other threads have also found a triangle. The code in the source uses a direct compiled statement to check this, since it wouldn't be compiled in an optimal way. The code that you would need to check if all threads have a triangle is simple:

if(!__any(leafAddr >= 0))
break;

Where leafAddr indicates a triangle address. The statement simply checks if there is a thread for which the address is larger than or equal to 0. If there are none, it means all threads have found a triangle to check and we can stop traversing the BVH.

I hope to have cleared up some of the magic parts of the source code. Of course, if you have any more questions about it, feel free to post them below.

In part 2, I showed you that you can speed up your ray tracing algorithm by means of a spatial data structure. In this post I would like to address optimizing GPU code using CUDA. Below, I give some general tips to boost parallel performance. Be aware that these points will not always be relevant. I will explain why, and under which conditions these points can optimize your algorithm:

Try to avoid divergent statements and recursive algorithms.

Keep track of memory alignment for reads and writes of large datasets.

Divergence

Why is this so important? Most of the performance issues can be devoted to divergence. To explain divergence, I'm going to show a simple example. Say we want to multiply all even numbers by 2, and divide all uneven number by 2. A simple parallel reduction creates us the following:

__device__ void divergent(int* array, int nNumbers)
{
int idx = threadIdx.x + blockDim.x * blockIdx.x;
for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x)
{
if (array[i] % 2 == 0)
array[i] *= 2;
else
array[i] /= 2;
}
}

CUDA can run thousands of threads in parallel, but it comes with a small trick called warps. These warps also exist in other GPU specific languages like OpenCL or compute shaders, but they are referred to as wavefronts. CUDA launches a set of 32 threads as one simultaneous bunch called a warp. Here is step by step what happens while executing one warp:

The 32 threads each calculate their respective idx.

The starting point of the for loop is the same as idx.

The first conditional statement: all of the threads for which this statement is true (16 threads) will execute, the other threads (16) are idling.

The second conditional statement: same as the first, only the other way around.

Increase the loop index, and repeat from 3 until the end of the loop.

In total we've just executed the same amount of statements as in a serial variant of this algorithm. However, the serial variant does not have any idle time. If you would sum the total time spent executing this algorithm for both variants, the serial algorithm would take less total time than this parallel variant. Simply because half of the threads are idling all the time!

This is where the term divergence originated from: divergent statements blocking full parallel execution in warps or wavefronts. In the example above, we have at maximum 50% divergence, meaning half of the threads follow a different path than linear code execution. So, how do we get the algorithm from above to execute in parallel without divergence? Fairly simple:

__device__ void divergent(int* array, int nNumbers)
{
int idx = threadIdx.x + blockDim.x * blockIdx.x;
idx *= 2; // All even numbers
for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x*2)
array[i] *= 2;
idx++; // All uneven numbers
for (int i = idx; i < nNumbers; i += blockDim.x*gridDim.x*2)
array[i] /= 2;
}

By removing the divergent statements, we get code that leaves no thread idling. Of course this was a simple example, and there are many branching statements which are impossible to solve the same way as above.

Recapping on point 1: Try to avoid divergent statements and recursive algorithms. It's incredibly easy to make a recursive call which causes divergence in the parallel execution. Keep in mind that the recursive algorithms are just a point of view: if you can guarantee non-divergent recursive calls, there is nothing wrong with recursive calls!

Memory alignment

To explain memory alignment I'm going to use an example from raytracing. In order to optimize my path tracer, I had to split the GPU code in different kernels for execution. I will detail the process in the next post. For now, we have to save the ray state in a datastructure as small as possible to avoid high memory latency. This is what I came up with:

struct RayState
{
Ray ray; // contains position and direction (both 3D vectors)
float minDist;
int triangleID;
};

The total size of this structure is a simple sum: the amount of floats and integers times their size in bytes. I'm assuming that an integer and float have the same size, because in most compilers they do. With this assumption, in total we have 8 times the size of a float, which is usually 32 bits or 4 bytes.

If you would use this structure to load and save data, there is a small problem: this structure will read in 4 bytes a time, since we did not tell CUDA there is any alignment whatsoever. You can profile this behavior with NSight, which shows the following in memory transactions:

Our tactical choice lead CUDA to read in by the lowest alignment, which is 32 bits. This can be a problem, if you look at this graph:

This was measured for a Tesla C2050, which is a bit outdated by now, but by no means a bad statistic to look at. There is a clear trend showing from this graph: the lower the amount of bits read per transaction, the slower memory access speed. This statement is only true if the amount of threads per multiprocessor is low. This is true in our optimized path tracer, but again, more on this in the next post.

This is where the magic happens. As described in the endless documentation for CUDA (CUDA programming guide), we can declare aligned memory on the device. Which allows us to read our list of structures as an array, if you know the memory size of a single element. We can read our structure as if it is an array of float4 elements (128 bits). The following piece of code does just this:

// Interpret as float4 array, and transform to structure
float4 r1 = ((float4*)rays)[rayidx * 2];
float4 r2 = ((float4*)rays)[rayidx * 2 + 1];
float3 raypos = make_float3(r1.x, r1.y, r1.z);
float3 raydir = make_float3(r1.w, r2.x, r2.y);
float2 data = make_float2(r2.z, r2.w);

I read in two float4 elements from the array by converting the index. This is a simple trick: we know the total size of the structure, which is 256 bits. We are trying to read in 128 bit elements, so the index has to be multiplied by two to get a correct index for a float4. A profiling session now shows the following data:

Which means the data is now read in as aligned 128 bit reads. Mission successful! Recapping on point 2: Keep track of memory alignment for reads and writes of large datasets. We've successfully read in aligned data on the GPU. However, there is room for more memory optimization by using data locality. Since this has already become quite a long post, I will postpone this for another time.