Integrating PDW into an Existing SQL Server Landscape

Parallel Data Warehouse (PDW) is Microsoft's true Data Warehouse appliance, capable of unparalleled performance with the support and backing of Microsoft and the hardware vendor.  PDW is a scalable appliance ranging from a 2 rack 'entry level' solution to a 10 rack scaled solution and the 'entry level' appliance is deemed a half rack solution due to the fact that although 2 racks are deployed (minimum PDW footprint is always 2 racks), only half of each rack is populated with hardware.

PDW is an appliance and therefore there are some 'constraints' when working with a 'black box' solution.  That being said, PDW is not 'just' a black box; it's optimized hardware which is designed to run a specific workload, that of a Data Warehouse.  It's not the answer to everyone's prayers, however it is the answer to the challenges and expectations of a modern business with modern data analysis requirements.  So if this is your prayer then I'm pleased to say that Microsoft has listened ... and continues to listen. 

PDW is an ongoing evolution of taking and proving customer workloads, introducing new feature development, tight SQL Server integration (v-Next) as well as re-engineering of the original product and of course, lab & customer testing; all in the name of driving more performance from the appliance without the overhead of tedious and cost-hungry development and testing investment from the customer.

So how can PDW integrate into 'my' existing SQL Server landscape?

Ultimately, SQL Server PDW is still a SQL Server product and therefore existing integration approaches, toolsets and API's are available for such integration requirements.

This sounds so easy, it must be difficult, right?  Actually, no, integration of PDW could not be simpler.  Whether PDW is required to integrate into a Microsoft landscape, a mixed technology landscape or a foreign technology landscape, SQL Server PDW will most certainly integrate.

The simplest form of integration is where PDW exports data to a landing zone for 'absorption' by the customers business.  Consider a file share with one or more CSV files.  The data is in a format which is easily absorbed by any application, and the data is readily available in it's RAW format.  But hey, this isn't integration, this is data export, right?  Actually it's both.  PDW although a SQL Server product, is still infantile in it's broad capability (it is a Data Warehousing powerhouse, it is not a landing area for 'anything SQL Server') and therefore we must accept that a MPP (Massively Parallel Processing) 'powerhouse' when integrating into a SMP (SyMetric Processing) environment, PDW going to be bottlenecked severely.  Why?  PDW will 'run' at 40-50 times faster than a regular SQL Server deployment.  So by exporting data to a CSV file for example, we allow PDW to integrate at it's most basic level - at the data level.

Not impressed by CSV files eh?  Me neither, so let's look at integration using a model which is familiar to SQL Server; the Linked Server.  In short, a linked server is a remote SQL Server instance which is configured as a queryable endpoint within the 'source' SQL Server\PDW instance.  In order to reference a linked server, the servername should be used in the object T-SQL syntax:  [Server Name].[Database Name].[Owner].[Table\View].[Column].  It is with this syntax that PDW can 'talk SQL' to a remote SQL Server instance and perform the required level of integration.  Whether PDW is 'pushing' data to a SQL Instance, or PDW is 'pulling' data from a SQL instance, the integration is performed using native SQL Server syntax and technology, and all without having to learn new capabilities or skills if the user\developer is "S3" (SQL Server Savvy).

Let me explain, or try to.... Imagine PDW being the water tank in your loft\attic which is filled with water.  Now add 'regular' SQL Server into the 'plumbing' and in this case, 'regular' SQL Server being the glass you fill up with water.  "What size is the glass?" I hear someone ask.  OK, the glass is capable of holding 300ml of water.
The pipework and taps are the PDW infrastructure and without it, there would be no water to put into your glass.  As you may have guessed, the water is 'the data'.

The 'glass' can only hold so much water before it spills out and gets everything wet.  Imagine 'PDW' being a 200 gallon tank filled to the brim (that's a lot of water!) and now try and visualise 200 gallons of water pouring into the 300ml glass.  300ml of water would fit; 199.9 gallons would either be spilt, or would require extra glasses to fill. 

The spillage is the lost capability and capacity of performing the required data analysis.  The speed at which the water fills the glass is the IO capability (in this case, Output). 
The extra glasses are the overhead of managing the flow of data from PDW to an SMP enviornment such as SQL Server 2008 R2 or SQL Server 2012.

How do you manage any over-spill? How do you clean up any spilt water? If the water is on the floor, is there a potential for the water to be contaminated? If our water is data - do we suffer with contaminated water, or rather damaged data? If we manage the flow of data from PDW to another platform, what overhead are we introducing into the existing infrastructure and destination platform? PDW is a powerhouse and as designed, PDW will do everything it can in the shortest parallel timeframe possible.

A small pipe will cause the water to flow very slowly, where as a large pipe will cause the water to flow at a much higher rate. A slow filling glass is much easier to manage than a glass with water pouring into it, but sometimes patience is not a virtue and we need our data to be available to other business areas. So we perform parallel export.

Parallel Export techniques are new to the SMP world, purely because SMP can sequence tasks to run in order, however only MPP can execute tasks in a pure parallel manner, allowing the workload of the operation to be performed by multiple compute nodes, rather than by a single shared proessor kernel. 

A superb article by Matthew Winter (SQL Server PDW Centre Of Excellence) on Parallel Data Export is available at the following URL:

http://social.technet.microsoft.com/wiki/contents/articles/12119.pdw-data-export-strategies.aspx

PDW is configured with multiple Infiniband networks allowing for ~20GB\Sec transfer rates within the appliance.  Regular SQL Server would be able to handle the data and the flow control, however the process would be bottlenecked by 'regular' SQL Server, purely because it is very likely that the 'regular' SQL Server instance will be connected to the 'regular' business infrastructure.

Driving performance and capability is key to a successful PDW deployment and we must remain mindful of this when architecting solutions with multiple flavours of 'SQL Server' and business 'BI and Analytics', especially when PDW is involved.  We can even include FastTrack DW solution in this statement too; ultimately FastTrack is a 'DW powerhouse' in it's own right, except that the workload scale is smaller and the hardware is commodity hardware which is tuned and tweaked to get every ounce of performance out of it as possible (tweaking performed by the clever guys who make the hardware).

For Integration to be successful thought must be applied to the (1) business requirement, (2) the technology with which we are integrating into and (3) the 'end to end' solution.