SQL Server database
From: Developing big data solutions on Microsoft Azure HDInsight
Many organizations already use tools and services that are part of the comprehensive Microsoft data platform to perform data analysis and reporting, and to implement enterprise BI solutions. Many of these tools, services, and reporting applications make it easier to query and analyze data in a relational database, rather than consuming the data directly from HDInsight. For example, Multidimensional SSAS data models can only be based on relational data sources.
The wide-ranging support for working with data in a relational database means that, in many scenarios, the optimal way to perform big data analysis is to process the data in HDInsight but consume the results through a relational database system such as SQL Server or Azure SQL Database. You can accomplish this by enabling the relational database to act as a proxy or interface that transparently accesses the data in HDInsight on behalf of its client applications, or by transferring the results of HDInsight data processing to tables in the relational database.
Linked servers
Linked servers are server-level connection definitions in a SQL Server instance that enable queries in the local SQL Server engine to reference tables in remote servers. You can use the ODBC driver for Hive to create a linked server in a SQL Server instance that references an HDInsight cluster, enabling you to execute Transact-SQL queries that reference Hive tables.
To create a linked server you can either use the graphical tools in SQL Server Management Studio or the sp_addlinkedserver system stored procedure, as shown in the following code.
EXEC master.dbo.sp_addlinkedserver
@server = N'HDINSIGHT', @srvproduct=N'Hive',
@provider=N'MSDASQL', @datasrc=N'HiveDSN',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=UserName;
Password=P@ssw0rd;'
After you have defined the linked server you can use the Transact-SQL OpenQuery function to execute pass-through queries against the Hive tables in the HDInsight data source, as shown in the following code.
SELECT * FROM OpenQuery(HDINSIGHT, 'SELECT * FROM Observations');
Note
Using a four-part distributed query as the source of the OpenQuery statement is not always a good idea because the syntax of HiveQL differs from T-SQL in several ways.
By using a linked server you can create views in a SQL Server database that act as pass-through queries against Hive tables, as shown in Figure 1. These views can then be queried by analytical tools that connect to the SQL Server database.
Figure 1 - Using HDInsight as a linked server over ODBC
Note
You must be aware of some issues such as compatible data types between HiveQL and SQL, and some language syntax limitations, when using a linked server. The issues and the supported data types are described in the blog post How to create a SQL Server Linked Server to HDInsight HIVE using Microsoft Hive ODBC Driver.
The following table describes specific considerations for using linked servers in the HDInsight use cases and models described in this guide.
Use case |
Considerations |
---|---|
Iterative data exploration |
For one-time analysis, or analysis by a small group of users, the requirement to use a relational database such as SQL Server as a proxy or interim host for the HDInsight results means that this approach involves more effort than using a tabular data model or just analyzing the data in Excel. |
Data warehouse on demand |
Depending on the volume of data in the data warehouse, and the frequency of queries against the Hive tables, using a linked server with a Hive-based data warehouse might make it easier to support a wide range of client applications. A linked server is a suitable solution for populating data models on a regular basis when they are processed during out-of-hours periods, or for periodically refreshing cached datasets for Reporting Services. However, the performance of pass-through queries over an ODBC connection may not be sufficient to meet your user’s expectations for interactive querying and reporting directly in client applications such as Excel. |
ETL automation |
Generally, the target of the ETL processes is a relational database, making a linked server that references Hive tables unnecessary. |
BI integration |
If the ratio of Hive tables to data warehouse tables is small, or they are relatively rarely queried, a linked server might be a suitable way to integrate data at the data warehouse level. However, if there are many Hive tables or if the data in the Hive tables must be tightly integrated into a dimensional data warehouse schema, it may be more effective to transfer the data from HDInsight to local tables in the data warehouse. |
PolyBase
PolyBase is a data integration technology in the Microsoft Analytics Platform System (APS) that enables data in an HDInsight cluster to be queried as native tables in a relational data warehouse that is implemented in SQL Server Parallel Data Warehouse (PDW). SQL Server PDW is an edition of SQL Server that is only available pre-installed in an APS appliance, and it uses a massively parallel processing (MPP) architecture to implement highly scalable data warehouse solutions.
PolyBase enables parallel data movement between SQL Server and HDInsight, and supports standard Transact-SQL semantics such as GROUP BY and JOIN clauses that reference large volumes of data in HDInsight. This enables APS to provide an enterprise-scale data warehouse solution that combines relational data in data warehouse tables with data in an HDInsight cluster.
The following table describes specific considerations for using PolyBase in the HDInsight use cases and models described in this guide.
Use case |
Considerations |
---|---|
Iterative data exploration |
For one-time analysis, or analysis by a small group of users, the requirement to use an APS appliance may be cost-prohibitive, unless such an appliance is already present in the organization. |
Data warehouse on demand |
If the volume of data and the number of query requests are extremely high, using an APS appliance as a data warehouse platform that includes HDInsight data through PolyBase might be the most cost-effective way to achieve the required levels of performance and scalability your data warehousing solution requires. |
ETL automation |
Generally, the target of the ETL process is a relational database, making PolyBase integration with HDInsight unnecessary. |
BI integration |
If your enterprise BI solution already uses an APS appliance, or the combined scalability and performance requirements for enterprise BI and big data analysis is extremely high, the combination of SQL Server PDW with PolyBase in a single APS appliance might be a suitable solution. However, note that PolyBase does not inherently integrate HDInsight data into a dimensional data warehouse schema. If you need to include big data in dimension members that use surrogate keys, or you need to support slowly changing dimensions, some additional integration effort may be required. |
Sqoop
Sqoop is a Hadoop technology included in HDInsight. It is designed to make it easy to transfer data between Hadoop clusters and relational databases. You can use Sqoop to export data from HDInsight data files to SQL Server database tables by specifying the location of the data files to be exported, and a JDBC connection string for the target SQL Server instance. For example, you could run the following command on an HDInsight server to copy the data in the /hive/warehouse/observations path to the observations table in an Azure SQL Database named mydb located in a server named jkty65.
sqoop export --connect "jdbc:sqlserver://jkty65.database.windows.net:1433;
database=mydb;user=username@jkty65;password=Pa$$w0rd;
logintimeout=30;"
--table observations
--export-dir /hive/warehouse/observations
Sqoop is generally a good solution for transferring data from HDInsight to Azure SQL Database servers, or to instances of SQL Server that are hosted in virtual machines running in Azure, but it can present connectivity challenges when used with on-premises database servers. A key requirement is that network connectivity can be successfully established between the HDInsight cluster where the Sqoop command is executed and the target SQL Server instance. When used with HDInsight this means that the SQL Server instance must be accessible from the Azure service where the cluster is running, which may not be permitted by security policies in organizations where the target SQL Server instance is hosted in an on-premises data center.
Note
In many cases you can enable secure connectivity between virtual machines in Azure and on-premises servers by creating a virtual network in Azure. However, at the time of writing it was not possible to add the virtual machines in an HDInsight cluster to an Azure virtual network, so this approach cannot be used to enable Sqoop to communicate with an on-premises server hosting SQL Server without traversing the corporate firewall.
You can use Sqoop interactively from the Hadoop command line, or you can use one of the following techniques to initiate a Sqoop job:
- Create a Sqoop action in an Oozie workflow.
- Implement a PowerShell script that uses the New-AzureHDInsightSqoopJobDefinition and Start-AzureHDInsightJob cmdlets to run a Sqoop command.
- Implement a custom application that uses the .NET SDK for HDInsight to submit a Sqoop job.
The following table describes specific considerations for using Sqoop in the HDInsight use cases and models described in this guide.
Use case |
Considerations |
---|---|
Iterative data exploration |
For one-time analysis, or analysis by a small group of users, using Sqoop is a simple way to transfer the results of data processing to SQL Database or a SQL Server instance for reporting or analysis. |
Data warehouse on demand |
When using HDInsight as a data warehouse for big data analysis, the data is generally accessed directly in Hive tables—making transfer to a database using Sqoop unnecessary. |
ETL automation |
Generally, the target of the ETL processes is a relational database, and Sqoop may be the mechanism that is used to load the transformed data into the target database. |
BI integration |
When you want to integrate the results of HDInsight processing with an enterprise BI solution at the data warehouse level you can use Sqoop to transfer data from HDInsight into the data warehouse tables, or (more commonly) into staging tables from where it will be loaded into the data warehouse. However, if network connectivity between HDInsight and the target database is not possible you may need to consider an alternative technique to transfer the data, such as SQL Server Integration Services. |
SQL Server Integration Services
SQL Server Integration Services (SSIS) provides a flexible platform for building ETL solutions that transfer data between a wide range of data sources and destinations while applying transformation, validation, and data cleansing operations to the data as it passes through a data flow pipeline. SSIS is a good choice for transferring data from HDInsight to SQL Server when network security policies make it impossible to use Sqoop, or when you must perform complex transformations on the data as part of the import process.
To transfer data from HDInsight to SQL Server using SSIS you can create an SSIS package that includes a Data Flow task. The Data Flow task minimally consists of a source, which is used to extract the data from the HDInsight cluster; and a destination, which is used to load the data into a SQL Server database. The task might also include one or more transformations, which apply specific changes to the data as it flows from the source to the destination.
The source used to extract data from HDInsight can be an ODBC source that uses a HiveQL query to retrieve data from Hive tables, or a custom source that programmatically downloads data from files in Azure blob storage.
Figure 2 shows an SSIS Data Flow task that uses an ODBC source to extract data from Hive tables, applies a transformation to convert the data types of the columns returned by the query, and then loads the transformed data into a table in a SQL Server database.
Figure 2 - Using SSIS to transfer data from HDInsight to SQL Server
The following table describes specific considerations for using SSIS in the HDInsight use cases and models described in this guide.
Use case |
Considerations |
---|---|
Iterative data exploration |
For one-time analysis, or analysis by a small group of users, SSIS can provide a simple way to transfer the results of data processing to SQL Server for reporting or analysis. |
Data warehouse on demand |
When using HDInsight as a data warehouse for big data analysis, the data is generally accessed directly in Hive tables, making transfer to a database through SSIS unnecessary. |
ETL automation |
Generally, the target of the ETL processes is a relational database. In some cases the ETL process in HDInsight might transform the data into a suitable structure, and then SSIS can be used to complete the process by transferring the transformed data to SQL Server. |
BI integration |
SSIS is often used to implement ETL processes in an enterprise BI solution, so it’s a natural choice when you need to extend the BI solution to include big data processing results from HDInsight. SSIS is particularly useful when you want to integrate data from HDInsight into an enterprise data warehouse, where you will typically uses SSIS to extract the data from HDInsight into a staging table, and perhaps use a different SSIS package to load the staged data in synchronization with data from other corporate sources. |
Guidelines for integrating HDInsight with SQL Server
When planning to integrate HDInsight with SQL Server, consider the following guidelines:
- PolyBase is only available in Microsoft APS appliances.
- When using Sqoop to transfer data between HDInsight and SQL Server, consider the effect of firewalls between the HDInsight cluster in Azure and the SQL Server database server.
- When using SSIS to transfer data from Hive tables, use an explicit ODBC connection string instead of a DSN. This enables the SSIS package to run on a server where the DSN is not available.
- When using SSIS to transfer data from Hive tables, specify the DefaultStringColumnLength parameter in the ODBC connection string. The default value for this setting is 32767, which results in SSIS treating all strings as DT_TEXT or DT_NTEXT data type values. For optimal performance, limit strings to 4000 characters or less so that SSIS automatically treats them as DT_STR or DT_WSTR data type values.
- When using SSIS to work with Hive ODBC sources, set the ValidateExternalMetadata property of the ODBC data source component to False. This prevents Visual Studio from validating the metadata until you open the data source component, reducing the frequency with which the Visual Studio environment becomes unresponsive while waiting for data from the HDInsight cluster.