Transitive Closure Sample in C++ AMP

 

In a directed graph, determining transitive closure or connectivity between the nodes is an important problem. Transitive closure is required in reachability analysis of networks representing distributed and parallel systems. In this blog post I am sharing a transitive closure sample implementation using C++ AMP. The sample is based on the research paper: All-Pairs Shortest-Paths for Large Graphs on the GPU (pdf).

main – Entry point

This is a driver function which will generate input data randomly, then compute the transitive closure on the CPU using the generated data. Then it initiates the computation on the GPU using the same generated data and eventually it compares the results between the CPU and the GPU computations.

verify_transitive_closure – Verify result

This function does two verifications:

  • Verifies the CPU results against the GPU results for connectivity between nodes i.e. whether they are connected or not connected, and if connected directly or indirectly.
  • Using the GPU results, for any indirectly connected nodes, it recursively verifies if the path between the nodes in question actually exist.

 

transitiveclosure_cpu_naive - C++ Serial

This algorithm has three nested for loops, where the outer loop index is used as an intermediate node which may connect two nodes in graph. The inner two loops run through the graph and for any unconnected node, check if they are connected via the intermediate node (specified by the outer for loop). If the two nodes are connected via intermediate node, update the connected nodes with the intermediate node index and mark the two nodes as being connected indirectly.If there are N nodes in a graph, connectivity information between nodes is stored in 2D array (of size NxN), for example, the connectivity information between node i and j are stored in an array graph[i, j] .

transitiveclosure_amp_tiled - C++ AMP Tiled Implementation

This is a three stage algorithm which is best described in the research paper linked at the start of this blog post. A key point mentioned in the paper and implemented in this sample is reduced memory copy operations. In the sample, the copy-in to the GPU global memory happens only at the beginning and the copy-out happens only at the end of function. There is no copy between different invocations of parallel_for_each; once an array is allocated in GPU global memory and data is copied there, then subsequent kernel invocations reuse this array without additional copies.

Download the sample

Please download the attached project of the Transitive Closure that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.


TransitiveClosure.zip

Comments

  • Anonymous
    August 13, 2012
    The comment has been removed

  • Anonymous
    August 13, 2012
    Hi Bernd Thank you for your question, please allow me to offer some generic comments, while addressing the specific question you asked.    1. Our samples are aimed to help folks get started with the API, they are definitely not aimed to be the fastest best possible implementation of an algorithm. This sample simply translates the algorithm from the paper it references to C++ AMP.    2. When measuring C++ AMP code, there are a bunch of extra considerations to take into account (beyond just using the timer), and the sample doesn’t do those. I don’t know if you did when you modified the code. Please start here and the links it points to: blogs.msdn.com/.../data-warm-up-when-measuring-performance-with-c-amp.aspx    3. To get speed up from offloading computations to the GPU, it typically takes really large amounts of data. For example, for this sample try changing the num_vertices from (1 << 6) to (1 << 10). Since 64 seems to be too small a number, you should see benefits with 1024. Cheers Daniel

  • Anonymous
    August 14, 2012
    Hello Daniel, I recompiled TransitiveClosure using (1<<10) vertices and I also repeated the computation on the GPU, so that we can treat the first run as the "warm-up" phase as suggested by the article you were referring to. The results are now much better (see below). The 2nd GPU run is almost twice as fast as the "warm-up" run. Now, the only remaining question I have is about the 2x ATI Radeon 4850 Crossfire machine. Why does AMP C++ not support that graphics card? Is there a list of GPUs and graphic cards that AMP C++ support? Thanks,

  • Bernd Using device : NVIDIA GeForce GTX 590 Generating input data..Done Running the Transitive Closure Sample... Number of Vertices = 1024 Iterations = 1 Computation on CPU...Elapsed: 703.089ms Computation on GPU...Elapsed: 113.645ms Computation on GPU...Elapsed: 61.5971ms Using device : NVIDIA GeForce GTX 570 Generating input data..Done Running the Transitive Closure Sample... Number of Vertices = 1024 Iterations = 1 Computation on CPU...Elapsed: 808.335ms Computation on GPU...Elapsed: 107.151ms Computation on GPU...Elapsed: 64.7328ms