Azure SQL Data Warehouse loading patterns and strategies

Authors: John Hoang, Joe Sack and Martin Lee

Abstract

This article provides an overview of the Microsoft Azure SQL Data Warehouse architecture. This platform-as-a service (PaaS) offering provides independent compute and storage scaling on demand. This document provides data loading guidelines for SQL Data Warehouse. Several common loading options are briefly described, but the main focus is the PolyBase technology, the preferred and fastest loading method for ingesting data into SQL Data Warehouse. See also What is Azure SQL Data Warehouse?

Introduction

Whether you are building a data mart or a data warehouse, the three fundamentals you must implement are an extraction process, a transformation process, and a loading process—also known as extract, transform, and load (ETL). When working with smaller workloads, the general rule from the perspective of performance and scalability is to perform transformations before loading the data. In the era of big data, however, as data sizes and volumes continue to increase, processes may encounter bottlenecks from difficult-to-scale integration and transformation layers.

As workloads grow, the design paradigm is shifting. Transformations are moving to the compute resource, and workloads are distributed across multiple compute resources. In the distributed world, we call this massively parallel processing (MPP), and the order of these processes differs. You may hear it described as ELT—you extract, load, and then transform as opposed to the traditional ETL order. The reason for this change is today’s highly scalable parallel computing powers, which put multiple compute resources at your disposal such as CPU (cores), RAM, networking, and storage, and you can distribute a workload across them.

With SQL Data Warehouse, you can scale out your compute resources as you need them on demand to maximize power and performance of your heavier workload processes.

However, we still need to load the data before we can transform. In this article, we'll explore several loading techniques that help you reach maximum data-loading throughput and identify the scenarios that best suit each of these techniques.

Architecture

SQL Data Warehouse uses the same logical component architecture for the MPP system as the Microsoft Analytics Platform System (APS). APS is the on-premises MPP appliance previously known as the Parallel Data Warehouse (PDW).

As you can see in the diagram below, SQL Data Warehouse has two types of components, a Control node and a Compute node:

Figure 1. Control node and Compute nodes in the SQL Data Warehouse logical architecture

image

The Control node is the brain and orchestrator of the MPP engine. We connect to this area when using SQL Data Warehouse to manage and query data. When you send a SQL query to SQL Data Warehouse, the Control node processes that query and converts the code to what we call a DSQL plan, or Distributed SQL plan, based on the cost-based optimization engine. After the DSQL plan has been generated, for each subsequent step, the Control node sends the command to run in each of the compute resources.

The Compute nodes are the worker nodes. They run the commands given to them from the Control node. Compute usage is measured using SQL Data Warehouse Units (DWUs). A DWU, similar to the Azure SQL Database DTU, represents the power of the database engine as a blended measure of CPU, memory, and read and write rates. The smallest compute resource (DWU 100) consists of the Control node and one Compute node. As you scale out your compute resources (by adding DWUs), you increase the number of Compute nodes.

Within the Control node and in each of the Compute resources, the Data Movement Service (DMS) component handles the movement of data between nodes—whether between the Compute nodes themselves or from Compute nodes to the Control node.

DMS also includes the PolyBase technology. An HDFS bridge is implemented within the DMS to communicate with the HDFS file system. PolyBase for SQL Data Warehouse currently supports Microsoft Azure Storage Blob and Microsoft Azure Data Lake Store.

Network and data locality

The first considerations for loading data are source-data locality and network bandwidth, utilization, and predictability of the path to the SQL Data Warehouse destination. Depending on where the data originates, network bandwidth will play a major part in your loading performance. For source data residing on your premises, network throughput performance and predictability can be enhanced with a service such as Azure Express Route. Otherwise, you must consider the current average bandwidth, utilization, predictability, and maximum capabilities of your current public Internet-facing, source-to-destination route.

Note Express Route routes your data through a dedicated connection to Azure without passing through the public Internet. ExpressRoute connections offer more reliability, faster speeds, lower latencies, and higher security than typical Internet connections. For more information, see Express Route.

Using PolyBase for SQL Data Warehouse loads

SQL Data Warehouse supports many loading methods, including non-PolyBase options (BCP and SQLBulkCopy API), and PolyBase options CTAS/INSERT, PolyBase with SSIS, Azure Data Factory (ADF), and third party tools including Azure Databricks, AttunityCloudbeam, Striim, Informatica, and Talend.

PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).

Note As a general rule, we recommend making PolyBase your first choice for loading data into SQL Data Warehouse unless you can’t accommodate PolyBase-supported file formats. Currently PolyBase can load data from UTF-8 and UTF-16 encoded delimited text files as well as the popular Hadoop file formats RC File, ORC, and Parquet (non-nested format). PolyBase can load data from gzip, zlib and Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-file format, WinZip and semi-structured data such as Parquet (nested/hierarchical), JSON, and XML.  A popular pattern to load semi-structured data is to use Azure Databricks or similarly HDI/Spark to load the data, flatten/transform to the supported format, then load into SQL DW.

As the following architecture diagrams show, each HDFS bridge of the DMS service from every Compute node can connect to an external resource such as Azure Blob Storage, and then bidirectionally transfer data between SQL Data Warehouse and the external resource.

Note As of this writing, SQL Data Warehouse supports Azure Blob Storage and Azure Data Lake Store as the external data sources.

Figure 2. Data transfers between SQL Data Warehouse and an external resource

image[13]

PolyBase data loading is not limited by the Control node, and so as you scale out your DWU, your data transfer throughput also increases. By mapping the external files as external tables in SQL Data Warehouse, the data files can be accessed using standard Transact-SQL commands—that is, the external tables can be referenced as standard tables in your Transact-SQL queries.

Copying data into storage

The general load process begins with migrating your data into Azure Blob Storage. Depending on your network’s capabilities, reliability, and utilization, you can use AZCOPY to upload your source data files to Azure Storage Blobs with an upload rate from 80 MB/second to 120 MB/second.

Then, in SQL Data Warehouse, you configure your credentials that will be used to access Azure Blob Storage:

CREATE DATABASE SCOPED CREDENTIAL myid_credential WITH IDENTITY = 'myid', Secret='mysecretkey';

 

Next you define the external Azure Blob Storage data source with the previously created credential:

CREATE EXTERNAL DATA SOURCE data_1tb WITH (TYPE = HADOOP, LOCATION = 'wasbs://data_1tb@myid.blob.core.windows.net', CREDENTIAL= myid_credential);

 

And for the source data, define the file format and external table definition:

CREATE EXTERNAL FILE FORMAT pipedelimited

WITH (FORMAT_TYPE = DELIMITEDTEXT,

      FORMAT_OPTIONS(

          FIELD_TERMINATOR = '|',

          STRING_DELIMITER = '',

          DATE_FORMAT = '',

          USE_TYPE_DEFAULT = False)

);

CREATE EXTERNAL TABLE orders_ext (

    o_orderkey bigint NULL,

    o_custkey bigint NULL,

    o_orderstatus char(1),

    o_totalprice decimal(15, 2) NULL,

    o_orderdate date NULL,

    o_orderpriority char(15),

    o_clerk char(15),

    o_shippriority int NULL,

    o_comment varchar(79)

)

WITH (LOCATION='/orders',

      DATA_SOURCE = data_1tb,

      FILE_FORMAT = pipedelimited,

      REJECT_TYPE = VALUE,

      REJECT_VALUE = 0

);

 

For more information about PolyBase, see SQL Data Warehouse documentation.

Using CTAS to load initial data

Then you can use a CTAS (CREATE TABLE AS SELECT) operation within SQL Data Warehouse to load the data from Azure Blob Storage to SQL Data Warehouse:

       CREATE TABLE orders_load

       WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH(o_orderkey),

            PARTITION (o_orderdate RANGE RIGHT FOR VALUES ('1992-01-01','1993-01-01','1994-01-01','1995-01-01')))

        as select * from orders_ext;

 

CTAS creates a new table. We recommend using CTAS for the initial data load. This is an all-or-nothing operation with minimal logging.

Using INSERT INTO to load incremental data

For an incremental load, use INSERT INTO operation. This is a full logging operation when inserting into a populated partition which will impact on the load performance. Furthermore, the roll-back operation on a large transaction can be expensive. Consider breaking your transaction into smaller batches.

       INSERT INTO TABLE orders_load

       select * from orders_current_ext;

Note The source is using different external table, orders_current_ext.  This is the external table defining the path for the incremental data on ASB.

Another popular pattern is to load into a partitioned aligned stage table via CTAS, then partition switch into the final table.

Data Reader, Writer consideration

The number of external readers and writers varies depending on the following factors:

  • Gen1 or Gen2 instance
  • SLO or size of the instance (DWU)
  • Type of operations (DMS query or PolyBase load
  • Type of file being loaded (Parquet, text, etc)
  • Concurrency at the time the operation was submitted (readers/writers auto-adjust dynamically)

As illustrated in Table 1 below, each DWU has a specific number of readers and writers.  As you scale out, each node gets additional number of readers and writers.  The static and dynamic resource classes also varies with the number of readers and writers.  Note that Parquet files typically has half the number of readers compared to non-Parquet files.  The number of readers and writers is an important factor in determining your load performance.

Table 1. Number of readers and writers for Gen 1 SQL DW xlargerc resource class

image

Table 2. Table 1. Number of readers and writers for Gen 2 SQL DW xlargerc resource class

image

To check for the number of readers/writers, use the following query (adjust the appropriate request_id and step_index). For more information, see link Monitoring your workload using DMVs

SELECT type, count(*) FROM sys.dm_pdw_dms_workers

WHERE request_id = 'QIDXXXX' AND step_index = XX

group by type;

Best practices and considerations when using PolyBase

Here are a few more things to consider when using PolyBase for SQL Data Warehouse loads:

  • A single PolyBase load operation provides best performance.
  • The load performance scales as you increase DWUs.
  • PolyBase automatically parallelizes the data load process, so you don’t need to explicitly break the input data into multiple files and issue concurrent loads, unlike some traditional loading practices.  Each reader automatically read 512MB for each file for Azure Storage BLOB and 256MB on Azure Data Lake Storage.
  • Multiple readers will not work against gzip files. Only a single reader is used per gzip compressed file since uncompressing the file in the buffer is single threaded. Alternatively, generate multiple gzip files.  The number of files should be greater than or equal to the total number of readers. 
  • Multiple readers will work against compressed columnar/block format files (e.g. ORC, RC) since individual blocks are compressed independently.

Known issues when working with different file formats

In addition to the UTF-8/UTF-16 encoding considerations, other known file format issues can arise when using PolyBase.

Mixed intra-file date formats

In a CREATE EXTERNAL FILE FORMAT command, the DATE_FORMAT argument specifies a single format to use for all date and time data in a delimited text file. If the DATE_FORMAT argument isn’t designated, the following default formats are used:

DateTime: 'yyyy-MM-dd HH:mm:ss'

  • SmallDateTime: 'yyyy-MM-dd HH:mm'
  • Date: 'yyyy-MM-dd'
  • DateTime2: 'yyyy-MM-dd HH:mm:ss'
  • DateTimeOffset: 'yyyy-MM-dd HH:mm:ss'
  • Time: 'HH:mm:ss'

For source formats that don’t reflect the defaults, you must explicitly specify a custom date format. However, if multiple non-default formats are used within one file, there is currently no method for specifying multiple custom date formats within the PolyBase command.

Fixed-length file format not supported

Fixed-length character file formats—for example, where each column has a fixed width of 10 characters—are not supported today.

If you encounter the restrictions from using PolyBase, considers changing the data extract process to address those limitations.  This could be formatting the dates to PolyBase supported format, transforming JSON files to text files, etc.  If the option is not possible, then your option is to use any one of the methods in the next section.

Using Control-node and single-client gated load methods

In the Architecture section we mentioned that all incoming connections go through the Control node. Although you can increase and decrease the number of compute resources, there is only a single Control node. And as mentioned earlier, one reason why PolyBase provides a superior load rate is that PolyBase data transfer is not limited by the Control node. But if using PolyBase is not currently an option, the following technologies and methods can be used for loading into SQL Data Warehouse:

  • BCP 
  • SQLBulkCopy 

For these load methods, the bottleneck is on the client machine and the single Control node. Each load uses a single core on the client machine and only accesses the single Control node. Therefore, the load does not scale if you increase DWUs for an SQL Data Warehouse instance.

Note You can, however, increase load throughput if you add parallel loads into either the same table or different tables.

Using multiple client concurrent executions should improve your load throughput - to a point. The number of parallel loads no longer improves your throughput when the maximum capacity of the Control node is reached.

Best practices and considerations for single-client gated load methods

Consider the following when using BCP, SQLBulkCopy or other Control-node and client-gated loading methods:

  • Issue multiple threads into different tables to improve throughput.  SQL DW does not support loading multiple threads into the same table since it requires exclusive lock.
  • Include retry logic.  
  • Don’t specify the batch size with Control-node gated methods. The goal is to load all or nothing so that the retry logic will restart the load. If you designate a batch size and the load encounters failure (for example, network or database not available), you may need to add more logic to restart from the last successful commit.

Comparing load method performance characteristics

The following table details the results for PolyBase options and non-PolyBase options BCP and SQLBulkCopy.

Table 2. SQL Data Warehouse performance testing results

image

As you can see, the PolyBase method are significantly fastest compared to BCP and SQLBulkCopy Control-node client gated load methods. If PolyBase is not an option, however, BCP provides the next best load rate.

Regarding increase throughput as you add more threads, (the third row in the chart), keep in mind that SQL Data Warehouse supports up to 128 concurrent queries/loads. It is unlikely that you can load 128 threads in parallel as resource contention will develop.  Take the time to balance your load throughput and concurrency for the best matrix to meet your ingestion SLA and user query experience.  For more information about concurrency, see Concurrency and workload management in SQL Data Warehouse.

Conclusion

SQL DW provides many options to load data as we discussed in this article. Each method has its own advantages and disadvantages. It’s easy to “lift and shift” your existing SSIS packages, BCP scripts and other Control-node client gated methods to mitigate migration effort. However, if you require higher speeds for data ingestion, consider rewriting your processes to take advantage of PolyBase with its high throughput, highly scalable loading methodology.

Comments

  • Anonymous
    August 19, 2016
    Very nice document and different approaches are explained very well. ThanksOne question related to Polybase. Is there a way to identify when a TSQL is running- if its using Polybase or not ? I have a statement Insert into SQLDW_Internal_Table select * from SQLDW_External_Table; Looking at above article, I feel - its using Polybase and it should be fast to load the data. But Is there a way for me to validate in any DMVs ? The reason to ask is - one of my data load having 23GB compressed data is taking more than 11+ hours , so trying to understand what is going wrong [ I am on 300DWU , largerc resource class and DWU usage in Azure portal says <50 DWU)
    • Anonymous
      August 30, 2016
      With PolyBase, throughput is improved by having your data split across multiple files, and within those files, benefitting from the file splitting that PolyBase will do on its own. In the case of your compressed file, it is likely the case that you're not benefitting from the file splitting since this cannot be done with compressed files. However, you can mitigate this issue by splitting your compressed data into a number of files by a multiple of the number of readers provided by the resource class you're using i.e. if you're using DW100 with 8 readers, you'll probably want to split your 23GB into some multiple of 8. A discussion about this very issue is discussed in the video at this link starting at around 30 minutes. https://azure.microsoft.com/en-us/documentation/videos/build-2016-a-developers-guide-to-azure-sql-data-warehouse/Hope this helps
  • Anonymous
    September 11, 2016
    What If I used the following combination to load data into Azure SQL DW1) Import data from text files into Azure Blob Storage using SSIS(instead of AZCOPY)2)Use Polybase to load data from Azure Blob Storage to Azure SQL DWWould I still be able to get a fast load and be able to scale if I increased the DWUThe advantage here being that everything could be stored in a package.
    • Anonymous
      October 05, 2016
      JSAD - Yes, your load rate will increase, on step 2 (Use Polybase to load data from Azure Blob Storage to Azure SQL DW), as you increase DWU.
      • Anonymous
        December 26, 2017
        Bloodlive, it is mentioned ealier in the post that polybase is compatible with TSQL. Technically I should be able to achieve the same performance using SSIS SQLtask and pass it my tsql/polybase statements, right?
  • Anonymous
    November 23, 2016
    This is by far and away the best post on SQL DW as it covers everything from overview to ingestion as well as providing a context and pros/cons. I hope you write more blog posts and articles in the future!!
  • Anonymous
    November 23, 2016
    Is there any update on when Polybase will support JSON from Blob Storage? Is this one of those things that is on the 'to-do' list or is there something fundamentally different in the architecture that makes this difficult or impossible?
    • Anonymous
      November 28, 2016
      It's on the list of things to do but not sure when it will be supported.
  • Anonymous
    December 12, 2016
    Thanks for this article, it is very clear and well written. I'm wondering if you could expand a bit more on loading incremental data. You're INSERT INTO expression loads the incremental data from a different external table than the one used to create it initially. If you were to load new incremental data every night, would you continuously create new external tables for each day or have some other mechanism?Thanks,Scott
    • Anonymous
      December 19, 2016
      Scott, Yes you would create external tables at a lower directory level. Let's say your data is laid out into /Customer/yyyy/mm/dd/ folder in Azure Storage. You can create external tables at the /dd/ level for loading. A simple stored procedure can work in this case.
      • Anonymous
        July 27, 2017
        Are you able to give us a link or demonstrate how that would be done?thanks!
  • Anonymous
    December 14, 2016
    Do I HAVE to load the table into Azure SQL DW using CTAS? Can I just create the external table and issue my aggregation SQL queries directly on the external table? Will the performance suffer in that case?
    • Anonymous
      December 19, 2016
      Abhay:Yes, for faster performance you should make it an internal table to SQL DW.
  • Anonymous
    December 29, 2016
    Hi,It was nice reading the article.These particular words in the Conclusion section caught my eye "It's easy to "lift and Shift your existing ssis packages" ".I would like to validate my understanding for the below scenarioif there are ssis packages that load data from source "table 1" to target data warehouse "table 2" (On-Premise), and if I migrated my On-Premise data warehouse to Azure SQL Data Warehouse. Just by changing the target connection in the SSIS packages from On-Premise to SQL Data Warehouse would suffice the need? Or is there anything more to this?Regards,KK
  • Anonymous
    January 23, 2017
    It's great article to start data load process to SQL DW.Thanks for shring useful information. I would like to know what are best ways to validate data loaded using Polybase. Suppose i have 45 files conataning 500M reacords. I would like to check whether all records from each file loaded into table successfully or not? Any thoughts on this.-Thanks in advance
  • Anonymous
    July 17, 2018
    I Understood the concept of loading data from Azure Blob to Azure SQL DW using Polybase.But How can i incrementally load data from Azure Blob to Azure SQL DW. I have a requirement to incrementally load data in SQL DW from Blob.Can you please respond ASAP.
    • Anonymous
      July 18, 2018
      Hi Vamsi,There are several options for incremental load, depending on your scenarios:For dimension tables:1. Flush and fill - Truncate final table ->INSERT INTO final table2. Append only - INSERT into final table3. ETL - CTAS into stage->DEL/UPDATE/INSERT into final table For FACT table:1. Append only - INSERT into final table2. ETL - CTAS into stage->DEL/UPDATE/INSERT into final table 3. CTAS into stage, partition switch into final tableFor real time sync, and built-in CDC, you may want to take a look at Attunity Cloudbeam Azure SQL Data Warehouse.
  • Anonymous
    December 12, 2018
    In your "Readers and Writers" table for Gen2 the numbers for Parquet and ORC/CSV are swapped. So when using a 1000c to read Parquet I see 40, not 64 readers. Other than that this is a fantastic reference.