Task design and context

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

When you are designing the individual tasks for an automated big data solution, and how they will be combined and scheduled, you should consider the following factors:

  • Task execution context
  • Task parameterization
  • Data consistency
  • Exception handling and logging

Task execution context

When you plan automated tasks, you must determine the user identity under which the task will be executed, and ensure that it has sufficient permissions to access any files or services it requires. Ensure that the context under which components, tools, scripts, and custom client applications will execute has sufficient but not excessive permissions, and for just the necessary resources.

In particular, if the task uses Azure PowerShell cmdlets or .NET SDK for HDInsight classes to access Azure services, you must ensure that the execution context has access to the required Azure management certificates, credentials, and storage account keys. However, avoid storing credential information in scripts or code; instead load these from encrypted configuration files where possible.

When you plan to schedule automated tasks you must identify the account that will be used by the task when it runs in an unattended environment. The context under which on-premises components, tools, scripts, and custom client applications execute requires sufficient permission to access certificates, publishing settings files, files on the local file system (or on a remote file share), SSIS packages that use DTExec, and other resources—but not HDInsight itself because these credentials will be provided in the scripts or code.

Windows Task Scheduler enables you to specify Windows credentials for each scheduled task, and the SQL Server Agent enables you to define proxies that encapsulate credentials with access to specific subsystems for individual job steps. For more information about SQL Server Agent proxies and subsystems see Implementing SQL Server Agent Security.

Task parameterization

Avoid hard-coding variable elements in your big data tasks. This may include file locations, Azure service names, Azure storage access keys, and connection strings. Instead, design scripts, custom applications, and SSIS packages to use parameters or encrypted configuration settings files to assign these values dynamically. This can improve security, as well as maximizing reuse, minimizing development effort, and reducing the chance of errors caused by multiple versions that might have subtle differences. See “Securing credentials in scripts and applications” in the Security section of this guide for more information.

When using SQL Server 2012 Integration Services or later, you can define project-level parameters and connection strings that can be set using environment variables for a package deployed in an SSIS catalog. For example, you could create an SSIS package that encapsulates your big data process and deploy it to the SSIS catalog on a SQL Server instance. You can then define named environments (for example “Test” or “Production”), and set default parameter values to be used when the package is run in the context of a particular environment. When you schedule an SSIS package to be run using a SQL Server Agent job you can specify the environment to be used.

If you use project-level parameters in an SSIS project, ensure that you set the Sensitive option for any parameters that must be encrypted and stored securely. For more information see Integration Services (SSIS) Parameters.

Data consistency

Partial failures in a data processing workflow can lead to inconsistent results. In many cases, analysis based on inconsistent data can be more harmful to a business than no analysis at all.

When using SSIS to coordinate big data processes, use the control flow checkpoint feature to support restarting the package at the point of failure.

Consider adding custom fields to enable lineage tracking of all data that flows through the process. For example, add a field to all source data with a unique batch identifier that can be used to identify data that was ingested by a particular instance of the workflow process. You can then use this identifier to reverse all changes that were introduced by a failed instance of the workflow process.

Exception handling and logging

In any complex workflow, errors or unexpected events can cause exceptions that prevent the workflow from completing successfully. When an error occurs in a complex workflow, it can be difficult to determine what went wrong.

Most developers are familiar with common exception handling techniques, and you should ensure that you apply these to all custom code in your solution. This includes custom .NET applications, PowerShell scripts, map/reduce components, and Transact-SQL scripts. Implementing comprehensive logging functionality for both successful and unsuccessful operations in all custom scripts and applications helps to create a source of troubleshooting information in the event of a failure, as well as generating useful monitoring data.

If you use PowerShell or custom .NET code to manage job submission and Oozie workflows, capture the job output returned to the client and include it in your logs. This helps centralize the logged information, making it easier to find issues that would otherwise require you to examine separate logs in the HDInsight cluster (which may have been deleted at the end of a partially successful workflow).

If you use SSIS packages to coordinate big data processing tasks, take advantage of the native logging capabilities in SSIS to record details of package execution, errors, and parameter values. You can also take advantage of the detailed log reports that are generated for packages deployed in an SSIS catalog.

Next Topic | Previous Topic | Home | Community