C++ AMP: N-Body Simulation Sample

My name is Bharath, I am an SDET on the C++ AMP team.

I worked on the NBody demo and wanted to share this project with you. You may have seen our PM (Daniel Moth) demonstrate at AMD Fusion Developer Summit and at Microsoft Build Conference (for the NBody demo watch - 0:03:00 through 0:08:00).

I ported the source code from the Microsoft DirectX SDK sample (you can download full DirectX SDK). You can get more information on nbody simulation on Wikipedia.

You can download the project with all sources from the zip file attached to this blog post. Refer to the README.txt for known issues. To build this project you need Visual Studio 11.

In later posts, I will walk you through this code and explain the different implementations used in this demo. Stay tuned.

 

NBody.zip

Comments

  • Anonymous
    October 03, 2011
    The comment has been removed

  • Anonymous
    October 04, 2011
    Hi Tim, Thanks for correcting SSE4 implementation. I have updated it with correct version of SSE4 implementation.

  • Anonymous
    November 07, 2011
    When I run the NBody C++ AMP sample on DirectX 11 capable 4-core GPU card in Win 7, the GFlops and FPS performance is significantly worse in "GPU Multi Device" than in "AMP Tiled" mode. I stepped through the source in VS11 Developer Preview, the multi-gpu function is being called instead of single core, it does go through a simple for loop on 4 gpu devices (_ndevices is equal to 4). However, should it be using PPL parallel_for_each instead of simple for loop to iterate through the 4 cores? Can someone confirm this? How can the NBoday sample utilize multi-core GPU capabilities? void nbody::amp_multi_gpu(particle render_data, int num_bodies) {  int size = ((int)((num_bodies/TILE_SIZE)/_ndevices)TILE_SIZE); for (int i = 0; i < _ndevices; i++)  {   tiling_implementation((_pold[i]), (_pnew[i]), isize, size, num_bodies);  }  for (int i = 0; i < _ndevices; i++)  {   index<1> begin(isize);  extent<1> end(size);  array_view<particle, 1> wrSrc = (_pnew[i]).section(grid<1>(begin, end));  for (int j = 0; j < size; j++)   {    render_data[j+(isize)] = (wrSrc.data())[j];   }  }  for (int i = 0; i < _ndevices; i++)  {   copy(render_data, (*_pold[i]));  } } Thanks.

  • Anonymous
    November 07, 2011
    The multi-device option is only useful when you have more than one GPU on your system. So when you say “4-core GPU card” do you really mean 4 discreet GPU cards? If not, then this option will not result in speedup and instead would result in slow downs. If you do indeed have 4 cards, are they all the same? The specific sample assumes that all cards are of exact equal specification, since it statically splits the data equally between all cards. Also, for that much horsepower, you need to increase the number of particles so you can saturate your system; you can do that through the MAX_GPU_PARTICLES in NBodyGravityCS11.cpp Regardless of the above points, you are right, I could have used parallel_for in this sample. I’ll revisit this code for the Beta release (whenever that is), thanks for the feedback.

  • Anonymous
    March 01, 2012
    I tried to compile the NBody project in blogs.msdn.com/.../c-amp-sample-projects-for-download.aspx.  VS 11 Beta reports missing d3dx11.h and some other DirectX files.  When wil the projects under the URL be updated?  For now, what should be done to get the project compiled. Thanks.

  • Anonymous
    March 01, 2012
    Hi P. So To build Nbody demo you need to install DirectX SDK. The compiler error you are seeing is due to missing DirectX SDK header file. From README.txt -Software requirement: Install June 2010 DirectX SDK from MSDN www.microsoft.com/.../details.aspx Install Visual Studio 11 from http://msdn.microsoft.com

  • Anonymous
    March 02, 2012
    Thanks Bharath: I do have the June 2010 DirectX SDK installed on my computer.  I have to add the SDK installation path to the Nbody project file in order for the compiler to find the header files.  This step is not needed if I were to build the project using VS 11 Developer Preview. Also, the changes in C++ AMP break the build process.  I would appreciate that if you could post an updated Nbody project for VS 11 Beta. Thank you again.

  • Anonymous
    March 02, 2012
    Hi P. So, After installing DirectX SDK, user should restart Visual Studio. This is necessary because DirectX SDK defines environment variables which is used in the new project. This wasnt necessary for VS 11 Developer preview because there werent significant changes affecting our dependencies. For your second comment. Can you post messages from VS output window? This will give us better picture of what you are experiencing. Since you are mentioning C++ AMP breaking changes, I think you may have downloaded the project before i updated it. Please try downloading it again? Thanks Bharath

  • Anonymous
    March 02, 2012
    Thanks.  The updated project works fine.

  • Anonymous
    March 18, 2012
    The sample is updated for muli-gpu scenario to use parallel_for instead of serial for loop. This will enable parallel kernel invokation on different GPUs and also parallel copy in/out data.

  • Anonymous
    March 27, 2012
    When I run the app, it tells me it's using the "reference" driver and performance will be slow. (It's correct on that point, it's so slow it's useless). Looking into it it seems there is no support for DX10 drivers. The DXUT library used only has checks for DX11 and DX9 but ignores DX 10. What exactly are the requirements for AMP to work? Is there any sort of app or anything to test if a system should support it?

  • Anonymous
    March 27, 2012
    Steve, as you found out, Microsoft's implementation of C++ AMP runs on DirectX 11 targets. Examples of such hardware can be found on this blog post: www.danielmoth.com/.../What-DX-Level-Does-My-Graphics-Card-Support-Does-It-Go-To-11.aspx

  • Anonymous
    May 26, 2012
    May 26 2011 The NBody.zip file has size zero bytes. perhaps is is about to be updated???

  • Anonymous
    May 28, 2012
    After download the zip file, cannot unzip it. Please check it.

  • Anonymous
    May 29, 2012
    Hi Paul, the sample has just been uploaded. Thanks.

  • Anonymous
    September 23, 2012
    I've been testing this sample both on a HD 5870 and 7970. Strangely the tiled version runs 2 times slower as the simple version, where as it should run much faster due to the use of shared memory. The original DX11 CS sample does not have this issue. Are there known performance issues that could explain this ?

  • Anonymous
    September 24, 2012
    Hi Jan, I cannot repro your issue. Are you using the latest driver? May you please tell us your environment (win7sp1 or win8rtm? driver version? win32 or x64?) And may you tell us the gflop you observed on your machine (the simple and tiles versions for each of the 2 cards you mentioned)? Thanks,

  • Anonymous
    September 24, 2012
    Hi Kevin, I've found the issue, originally I was running in debug mode. In release mode things run much better:     285 / 920 GFlops for simple / tiled with 20000 particles on HD 5870. I'm surprised debug performs so differently, as all runs on the GPU. Maybe debug does some emulation ?

  • Anonymous
    September 25, 2012
    If you run the code in debug mode, VS still uses GPU. If you debug the code with "GPU only", VS uses REF device (a simulator). DBG and RET use different code path which could lead to your problem.

  • Anonymous
    October 19, 2012
    I guess there will be no follow up on this? I would like to modify the source to include mass dependent size and charge dependent color but I'm new to C++ AMP and and DX and a bit lost on what to modify(I see the force calculation but the colors are hard coded.

  • Anonymous
    October 22, 2012
    The comment has been removed

  • Anonymous
    November 28, 2012
    "In later posts, I will walk you through this code and explain the different implementations used in this demo. Stay tuned." Any forward linking?  I'd love if this was released as a library without the visual representation.  I'd like to be able to call it as function(method, particles, time); where method is the S-CPU/M-CPU/AMP/AMP-T/M-AMP version, time is the length of time for the simulation, and returns the average FPS of the test.

  • Anonymous
    November 29, 2012
    @Ian, Hi Ian,  Thanks for expressing the need for having a library version of N-Body simulation sample. Currently we do not have any committed plans to convert the sample to a library version as desired by you. However, we are interested in understanding more about your project and how this library function call would help in such a case. Would you be willing to comment on that. you can contact me directly using bobyg AT Microsoft dot com if needed.

  • Anonymous
    November 29, 2012
    Unfortunately, there are no follow-up posts yet to this blog post. The C++ AMP book (www.gregcons.com/cppamp) has detailed discussion of this sample though.

  • Anonymous
    January 15, 2014
    I'd encourage you to look at the CPU implementation on the http://ampbook.codeplex.com site. The advanced CPU implementation there is a cache aware one and is significantly faster than the one shown here. It is also SSE2/4 enabled. Ade

  • Anonymous
    March 21, 2014
    I have installed VS 2013 and trying to run this sample code, getting the following error Error 1 error C4996: 'GetVersionExW': was declared deprecated. I have installed DirectX SDK, what could be the problem?

  • Anonymous
    November 25, 2014
    BharathM    can you give me a code for the n-body problem ??? i need it plz

  • Anonymous
    January 01, 2015
    The comment has been removed

  • Anonymous
    February 03, 2016
    The comment has been removed