Parallel Reduction using C++ AMP
Reduction in this sample computes sum of elements in a vector. This post will talk about different implementations of reduction and optimization from one implementation to the other. In this sample there are 8 different implementations of reduction, including a sequential CPU version and 3 categories of C++ AMP versions.
Please note that the goal of this post is to demonstrate various optimization techniques useful in C++ AMP. As the focus is set on the educational value, even the best performing implementation below may be further much improved sacrificing its comprehensibility. Reduction would be also much more useful if the input data was the result of other computation on the GPU, in the examples below the cost of copying data from the host memory may easily diminish the achieved computation performance gain.
main – Entry point
Here the input data is generated (as a pattern of repeated consecutive sequence) and the expected sum is computed mathematically. Each implementation of reduction is invoked and its result is compared with expected sum to verify the correctness of each implementation.
sequential_reduction
The first implementation is a CPU implementation of reduction (in our case sum) using the STL library function std::accumulate.
reduction_simple_1
This is an implementation of reduction using C++ AMP’s simple model (not explicitly tiled). The input data is stored in a concurrency::array and passed to kernel. The kernel is invoked in a loop and the loop control variable specifies the number of GPU threads to spawn. The loop control variable starts with half the number of input data size and further halves it with every iteration. The kernel calculates the sum of an element indexed by its thread id and another element at an offset (specified by the loop variable) relative to thread id. Like below:
a[idx] = a[idx] + a[idx + s];
As the algorithm does not require the input vector to have a power of two number of elements, there might be a “tail” cut off after each iteration. Before it happens, the last thread in the dispatch reduces its elements to a separate tail sum.
After all kernel invocations, the sum is stored in the 0th element of the array. To minimize copy out time, only the necessary partial results are copied out. This is done by creating an array_view of a section of array and then copying out only that section to CPU memory. In this case the 0th element is the copied out data, which, after adding the tail sum, constitutes the final result.
reduction_simple_2
Observations on the previous algorithm are that each kernel performs two reads and one write and there will be ≈log2N rounds of the reduction, performing ≈3N memory accesses total. The improvement for this implementation is to reduce more elements per kernel, e.g. read eight values, sum them and write one result. In that case there will be only ≈log8N (note different base) rounds of reduction and ≈1.28N memory accesses total. The heart of the kernel is then the following:
float sum = 0.f;
for(unsigned i = 0; i < window_width; i++)
{
sum += a[idx + i * s];
}
a[idx] = sum;
The one drawback though is more complex computation of the tail sum, effectively executed serially and leading to a notable thread divergence.
Ideally, the size of the problem passed to the reduction algorithm should be a power of the window_width reduced in every kernel, making the tail computation redundant. This is not as unrealistic as it sounds, as there is a systematical way to approach this requirement for any problem size – the reduction starts with the largest window width for the biggest possible subproblem and recurses with smaller windows for the remaining part (possibly all the way to reduction_simple_1). Finally all subproblem results may be reduced on the CPU. However this approach requires some fine tuning for the best performance and it has been omitted from the sample for the sake of simplicity.
After all kernel invocations, the partial sum will be stored in the first K elements of the array. As previously, only the interesting part is copied out and summed on the CPU resulting, with added tail sum, in the final result.
reduction_tiled_1
In this implementation, tiling is used to schedule GPU threads and tile_static memory is used to communicate between threads in a tile. The main purpose of this is to demonstrate some hardware-related caveats in using these features, which are addressed one by one in the subsequent implementations.
The algorithm allocates two concurrency::arrays used alternatively as the input and the output of the kernel. Although the problem size decreases after each reduction step, it is preferred to have unused storage in arrays than to reallocate them after each iteration. There are additionally two array_views created, named av_src and av_dst, pointing to the arrays. Note that as they merely point, swapping them is a quick operation, and the kernel may invariably use av_src as the input and av_dst as output, unaware which array it writes to.
The kernel is called in a loop where in every iteration the number of threads is reduced by a factor of _tile_size. During the execution, all threads in a tile read an element from the global memory cooperatively into tile_static array once and operate on the data in the tile_static memory afterwards. Each thread will copy an element based on its global thread id into tile_static memory and wait on barrier until all threads in the tile finish copying. From then on threads with local thread id as powers of 2 participate in reduction and the other operand will be the iteration count itself as stride. Below is the code from kernel:
for(int s = 1; s < _tile_size; s *= 2)
{
if (tid % (2*s) == 0)
{
tile_data[tid] += tile_data[tid + s];
}
tidx.barrier.wait();
}
In the first iteration every even thread in the tile reduce by summing element indexed by its tile local thread id and its neighbor (stride 1). In the second iteration every 4th thread in the tile will reduce by summing element indexed by its tile local thread id and its 2nd neighbor (stride 2). For 3rd iteration every 8th threads sums element indexed by its tile local thread id with every 4th neighbor and so on.
After this loop, the 0th thread in the tile will write the tile’s partial result to GPU global memory using the tile id.
The process is repeated until the number of partial results calculated becomes less than or equal to the tile size. Now the partial result is copied out to a vector and the final reduction is done using std::accumulate on CPU.
reduction_tiled_2
This implementation is similar to reduction_tiled_1 but minimizes idle threads in a tile. In the kernel of reduction_tiled_1, the number of active threads is reduced by half in each iteration and dispersed across the entire tile. In this implementation we have active consecutive threads in a tile which helps to reduce divergence within the GPU scheduling unit (aka warp or wavefront), improving resource utilization. Additionally, we are not using the local thread id directly as an index to access elements, and instead an index is calculated based on the local thread id plus a stride (which is the loop control variable). Apart from this change in the kernel, the kernel invocation in loop, copying out result and final summation is the same as reduction_tiled_1.
reduction_tiled_3
This kernel improves performance by reducing shared memory bank conflict. This can be achieved by accessing consecutive tile_static memory locations across each thread which eventually ends up accessing different banks of shared memory in GPU. Consecutive tile_static memory is accessed sequentially by threads like below.
for(int s = _tile_size / 2; s > 0; s /= 1)
{
if (tid < s)
{
tile_data[tid] += tile_data[tid + s];
}
tidx.barrier.wait();
}
reduction_tiled_4
In the above kernels, as we make progress in the innermost loop the number of idle threads increases. One way to decrease number of threads is to spawn less GPU threads. With fewer threads each thread should operate on more data. In this implementation, the number of GPU threads spawned is half of reduction_tiled_3. Instead of each thread reading only one element into the shared memory, this kernel reads two elements from the GPU global memory, sums them and stores the sum in the shared memory. Now all threads have to calculate a partial sum within the tile like in reduction_tiled_3.
reduction_cascade
This kernel is an improvement and an extension to reduction_tiled_4. Instead of calculating only one sum and reducing the number of thread by half, we take it to a higher degree. This can be further tuned to balance the number of threads and number of elements each thread has to process to achieve good performance for a specific scenario.
In this implementation there are no multiple kernel invocations. The predefined number of tiles is scheduled. Each thread in each tile repetitively reads and reduces to its corresponding tile_static memory location elements from the input structure with a stride equal to the number of tiles times tile size. Once the whole input array is processed, the tile result is computed similarly as in reduction_tiled_4 and written to an output array, containing effectively the partial results for every tile. The final result is the reduction of this structure performed on the CPU.
Download the sample
Please download the attached sample project of the reduction that I mentioned here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.