Shared memory support in C++ AMP – Introduction
Several modern CPU parts such as 3rd Gen (and later) Intel Core Processors (e.g. Ivy Bridge and Haswell) and all AMD APUs (e.g. Llano and Trinity), have an on-die integrated GPU accelerator which shares the same physical RAM with the CPU. On such parts creating copies of data from CPU memory for GPU access and (vice-versa) is mostly unnecessary. The same is applicable for the C++ AMP CPU fallback WARP accelerator.
As described in an earlier post, the concurrency::array_view type in C++ AMP is designed to free programmers from the burden of managing data transfers to/from accelerators. Users can program against the array_view abstraction, leaving the job of determining whether copying is required to provide access to the data on the target hardware, to the runtime. However, before Visual Studio 2013, the C++ AMP runtime always copies data between the CPU and an accelerator; even on WARP and integrated GPU accelerators where a shared system memory allocation can be efficiently accessed both on the CPU and the accelerator. Starting with Visual Studio 2013, on Windows 8.1 C++ AMP supports CPU/GPU shared memory to eliminate or significantly improve performance of data transfers on such accelerators, details of which I will share in a series of blog posts starting with this.
Querying shared memory support on a C++ AMP accelerator
With Visual Studio 2013, on Windows 8.1 (and later OS versions) C++ AMP accelerators will optionally support shared memory which is directly accessible both on the CPU and the accelerator. The following property on the accelerator type can be used to query shared memory support on a specific accelerator.
Returns a boolean flag indicating whether this accelerator supports shared memory that is accessible both on the accelerator and the CPU. |
Automatic use of shared memory by C++ AMP runtime
On select accelerators, where the CPU/GPU memory access performance characteristics (bandwidth and latency) for shared memory are known to be exactly same as dedicated CPU/GPU only memory, the C++ AMP runtime will use shared memory by default. This will eliminate and/or significantly reduce the cost of copying data to/from such accelerators, without requiring any changes in existing code. For Visual Studio 2013, this list comprises all 3rd Gen and later Intel Core Processors (e.g. Ivy Bridge and Haswell) with integrated GPU accelerators, and the WARP (C++ AMP CPU fallback) accelerator. This list of accelerators, where shared memory is used by default, will be revised in the future to include new accelerators depending on their shared memory access performance characteristics.
The following method of the accelerator type can be used to query the default CPU access_type for memory allocations. The default_cpu_access_type setting on an accelerator determines if and how shared memory allocations will be used by default when allocating array(s) on that accelerator or when array_view objects are accessed on the accelerator. In other words, this setting controls the default CPU accessibility of memory allocations on the accelerator.
enum access_type
{
access_type_none,
access_type_read,
access_type_write,
access_type_read_write = access_type_read | access_type_write,
access_type_auto,
};
access_type accelerator::get_default_cpu_access_type() const
__declspec(property(get=get_default_cpu_access_type)) access_type default_cpu_access_type
Returns the default CPU access_type for array allocations on this accelerator and for implicit allocations when array_view objects are accessed on this accelerator. Following is the meaning for different possible return values of this method:
|
For integrated GPU accelerators in 3rd Gen and later Intel Core Processors (e.g. Ivy Bridge and Haswell) and the WARP accelerator, by default the above method will return a value of access_type_read_write.
On all other accelerators, the runtime will not use shared memory by default and the above method will return a value of access_type_none. This means that by default the Visual Studio 2012 behavior of copying between the CPU and the GPU would result on such accelerators. However, the following API on the accelerator type is provided to enable users to override the default CPU access_type to be used for all memory allocations underlying array/array_view objects on that accelerator.
Sets the default CPU access_type for all array allocations (and implicit array_view allocations) on this accelerator’s accelerator_view(s). Note that this method can be used for setting the default CPU access_type only if no array allocations have been made on this accelerator with the access_type_auto value, and no array_view objects have been accessed on this accelerator. Also, this method throws a runtime_exception with the E_INVALIDARG error code if a _Default_cpu_access_type argument other than access_type_none is passed to an accelerator that does not support zero-copy. The function returns a boolean flag indicating if the default CPU access type for the accelerator was successfully overridden. |
The recommended usage of this feature is that during startup, an application sets the appropriate default CPU access_type for memory allocations underlying array/array_view objects on the target accelerator. Whether use of shared memory on a specific accelerator will be beneficial, and if so what would be the appropriate CPU access_type for an allocation, depends both on the CPU/GPU memory access pattern of the application and CPU/GPU memory access performance characteristics for shared memory allocations on that accelerator. The memory access performance characteristics for shared memory allocations on an accelerator can be profiled using micro-benchmarks. Such profiling should either be performed offline (like during application installation) or may be performed during the first time the application is run on a system, and the results cached for use in subsequent runs of the application for setting the accelerator’s default CPU access_type for array/array_view allocations without having to repeat the time consuming profiling step in each run of the application.
In closing
In this post we looked at the introductory concepts pertaining shared memory support in C++ AMP. Subsequent posts in this series will dive deeper into the functional and performance aspects of shared memory use in C++ AMP runtime - stay tuned!
I would love to hear your feedback, comments and questions below or in our MSDN forum.
Comments
Anonymous
July 09, 2013
Why is it that the AMD APUs are not listed as automatically making use of shared memory by default?Anonymous
July 09, 2013
Do we have shared memory support for textures as well?Anonymous
July 09, 2013
LKeene: I'd hazard a guess that it is due to the unequal performance that AMD APUs have when dealing with shared memory, based on the intended access pattern, due to the fact that they have separate buses as opposed to a single, homogeneous fabric (the L3 in Intel's case). Slide 35 from the following presentation by Boudier and Sellers is particularly telling: amddevcentral.com/.../1004_final.pdf. The same constraints pretty much apply to Trinity (the newer APUs) and, as seen in the above reference, makes it so that APU performance with shared memory is strictly inferior than without, thus violating one of the requirements Amit mentioned. One would hope that the upcoming Kaveri APUs will ease these constraints.Anonymous
July 12, 2013
The comment has been removed