Coordinating solutions and tasks

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

There are numerous options for coordinating the automation of the end-to-end process. Some common options on the Windows platform include scripts, custom applications, and tools such as SQL Server Integration Services (SSIS). This topic describes the following techniques:

  • Coordinating the process with Windows PowerShell scripts
  • Coordinating the process with a custom .NET application or service
  • Coordinating the process with SSIS

Coordinating the process with Windows PowerShell scripts

You can use PowerShell to automate almost all aspects of a typical HDInsight solution. You might use a single script or (more likely) a collection of related scripts that can be run interactively. These scripts may be called from a single control script, daisy-chained from one another so that each one starts the next, or each one could be scheduled to run at predetermined times.

Examples of tasks that can be automated with PowerShell include:

  • Provisioning and deleting Azure storage accounts and HDInsight clusters.
  • Uploading data and script files to Azure storage.
  • Submitting map/reduce, Pig, and Hive jobs that process data.
  • Running a Sqoop job to transfer data between HDInsight and a relational database.
  • Starting an Oozie workflow.
  • Downloading output files generated by jobs.
  • Executing the DTExec.exe command line tool to run an SSIS package.
  • Running an XMLA command in SQL Server Analysis Services (SSAS) to process a data model.

PowerShell is often the easiest way to automate individual tasks or sub-processes, and can be a good choice for relatively simple end-to-end processes that have minimal steps and few conditional branches. However, the dependency on one or more script files can make it fragile for complex solutions.

The following table shows how PowerShell can be used to automate an end-to-end solution for each of the big data use cases and models discussed in this guide.

Use case

Considerations

Iterative data exploration

Iterative data exploration is usually an interactive process that is performed by a small group of data analysts. PowerShell provides an easy-to-implement solution for automating on-demand provisioning and deletion of an HDInsight cluster, and for uploading commonly reused data and script files to Azure storage. The data analysts can then use PowerShell interactively to run jobs that analyze the data.

Data warehouse on demand

In this scenario you can use a PowerShell script to upload new data files to Azure storage, provision the cluster, recreate Hive tables, refresh reports that are built on them, and then delete the cluster.

ETL automation

In a simple ETL solution you can encapsulate the jobs that filter and shape the data in an Oozie workflow, which can be initiated from a PowerShell script. If the source and/or destination of the ETL process is a relational database that is accessible from the HDInsight cluster, you can use Sqoop actions in the Oozie workflow. Otherwise you can use PowerShell to upload source files and download output files, or to run an SSIS package using the DTExec.exe command line tool.

BI integration

In an enterprise BI integration scenario there is generally an existing established ETL coordination solution based on SSIS, and the processing of big data with HDInsight can be added to this solution. Some of the processing tasks may be automated using PowerShell scripts that are initiated by SSIS packages.

Coordinating the process with a custom .NET application or service

The .NET SDK for HDInsight provides a range of classes and interfaces that developers can use to interact with HDInsight, and additional .NET APIs enable integration with other software in the Microsoft data platform, such as SQL Server. This makes it possible to build custom applications for big data processing, or to enhance existing applications to integrate them with HDInsight.

Building a custom application or service to coordinate an HDInsight process is appropriate when you want to encapsulate your big data solution as a Windows service, or when you need to implement a business application that makes extensive use of big data processing.

The following table shows how custom .NET code can be used to automate an end-to-end solution for each of the big data use cases and models discussed in this guide.

Use case

Considerations

Iterative data exploration

Iterative data exploration is usually an interactive process that is performed by a small group of data analysts. Building a custom application to manage this process is usually not required. However, if the iterative analysis evolves into a useful, repeatable business practice you may want to implement a custom application that integrates the analytical processing into a business process.

Data warehouse on demand

If the data warehouse is specifically designed to help analysts examine data that is generated by a custom business application, you might integrate the process of uploading new data, provisioning a cluster, recreating Hive tables, refreshing reports, and deleting the cluster into the application using classes and interfaces from the .NET SDK for HDInsight.

ETL automation

As in the data warehouse on demand scenario, if the ETL process is designed to take the output from a particular application and process it for analysis you could manage the entire ETL process from the application itself.

BI integration

In this scenario there is generally an existing established ETL coordination solution based on SSIS, and the processing of big data with HDInsight can be added to this solution. Some of the processing tasks may be automated using custom SSIS tasks, which are implemented using .NET code.

Coordinating the process with SSIS

SQL Server Integration Services (SSIS) provides a platform for implementing complex control flows and data pipelines that can be coordinated and managed centrally. Even if you are not planning to use SSIS data flows to transfer data into and out of Azure storage, you can still make use of SSIS control flows to coordinate the sub-processes in your HDInsight-based big data process.

SSIS includes a wide range of control flow tasks that you can use in your process, including:

  • Data Flow. Data flow tasks encapsulate the transfer of data from one source to another, with the ability apply complex transformations and data cleansing logic as the data is transferred.
  • Execute SQL. You can use Execute SQL tasks to run SQL commands in relational databases. For example, after using Sqoop to transfer the output of a big data processing job to a staging table you could use an Execute SQL task to load the staged data into a production table.
  • File System. You can use a File System task to manipulate files on the local file system. For example, you could use a File System task to prepare files for upload to Azure storage.
  • Execute Process. You can use an execute process task to run a command, such as a custom command line utility or a PowerShell script.
  • Analysis Services Processing. You can use an Analysis Services Processing task to process (refresh) an SSAS data model. For example, after completing a job that creates Hive tables over new data you could process any SSAS data models that are based on those tables to refresh the data in the model.
  • Send Mail. You can use a Send Mail task to send a notification email to an operator when a workflow is complete, or when a task in the workflow fails.

Additionally, you can use a Script task or create a custom task using .NET code to perform custom actions.

SSIS control flows use precedence constraints to implement conditional branching, enabling you to create complex workflows that handle exceptions or perform actions based on variable conditions. SSIS also provides native logging support, making it easier to troubleshoot errors in the workflow.

The following table shows how SSIS can be used to automate an end-to-end solution for each of the big data use cases and models discussed in this guide.

Use case

Considerations

Iterative data exploration

Iterative data exploration is usually an interactive process that is performed by a small group of data analysts. Building an SSIS solution to automate this is unlikely to be of any significant benefit.

Data warehouse on demand

SSIS is designed to coordinate the transfer of data from one store to another, and can be used effectively for large volumes of data that require transformation and cleansing before being loaded into the target data warehouse. When the target is a Hive database in HDInsight, you can use SSIS Execute Process tasks to run command line applications or PowerShell scripts that provision HDInsight, load data to Azure storage, and create Hive tables. You can then use an Analysis Services Processing task to process any SSAS data models that are based on the data warehouse.

ETL automation

Although SSIS itself can be used to perform many ETL tasks, when the data must be shaped and filtered using big data processing techniques in HDInsight you can use SSIS to coordinate scripts and commands that provision the cluster, perform the data processing jobs, export the output to a target data store, and delete the cluster.

BI integration

In an enterprise BI integration scenario there is generally an existing established ETL coordination solution based on SSIS, and the processing of big data with HDInsight can be added to this solution.

See SQL Server Integration Services in the MSDN Library for information about how to use SQL Server Integration Services (SSIS) to automate and coordinate tasks.

Next Topic | Previous Topic | Home | Community