Scheduling solution and task execution
From: Developing big data solutions on Microsoft Azure HDInsight
Many data processing solutions are based on workflows containing recurring tasks that must execute at specific times. Data is often uploaded and processed as batches on regular schedules, such as overnight or at the end of each month. This may depend on a number of factors such as when all of the source data is available, when the processed results must be available for data consumers, and how long each step in the process takes to run.
Typically, initiation of an entire end-to-end solution or an individual task is one of the following types:
- Interactive: The operation is started on demand by a human operator. For example, a user might run a PowerShell script to provision a cluster.
- Scheduled: The operation is started automatically at a specified time. For example, the Windows Task Scheduler application could be used to run a PowerShell script or custom tool automatically at midnight to upload daily log files to Azure storage.
- Triggered: The operation is started automatically by an event, or by the completion (or failure) of another operation. For example, you could implement a custom Windows service that monitors a local folder. When a new file is created, the service automatically uploads it to Azure storage.
After the initial process or task has been started, it can start each sub-process automatically. Alternatively, you can start them on a scheduled basis that allows sufficient time for all dependent sub-processes to complete.
This topic discusses two different scheduling aspects for automated solutions:
- Scheduling automated tasks
- Scheduling data refresh in consumers
Scheduling automated tasks
Regardless of how you decide to implement the automated tasks in your big data process, you can choose from a variety of ways to schedule them for execution automatically at a specific time. Options for scheduling tasks include:
Windows Task Scheduler. You can use the Task Scheduler administrative tool (or the schtasks.exe command line program) to configure one-time or recurring commands, and specify a wide range of additional properties and behavior for each task. You can use this tool to trigger a command at specific times, when a specific event is written to the Windows event log, or in response to other system actions. Commands you can schedule with the Windows Task Scheduler include:
- Custom or third party application executables and batch files.
- PowerShell.exe (with a parameter referencing the PowerShell script to be run).
- DTExec.exe (with a parameter referencing the SSIS package to be run).
Note
See Task Scheduler Overview in the TechNet Library for more information about automating tasks with the Windows Task Scheduler.
SQL Server Agent. The SQL Server Agent is a commonly used automation tool for SQL Server related tasks. You can use it to create multistep jobs that can then be scheduled to run at specific times. The types of step you can use include the following:
- Operating System (CmdExec) steps that run a command line program.
- PowerShell steps that run specific PowerShell commands.
- SQL Server Analysis Services (SSAS) Command steps; for example, to process an SSAS data model.
- SQL Server Integration Services (SSIS) Package steps to run a SSIS packages.
- Transact-SQL steps to run Transact-SQL commands in a SQL Server database.
Note
See SQL Server Agent for more information about how you can automate tasks using SQL Server Agent.
SQL Server Agent offers greater flexibility and manageability than Windows Task Scheduler, but it requires a SQL Server instance. If you are already planning to use SQL Server, and particularly SSIS, in your solution then SQL Server Agent is generally the best way to automate scheduled execution of tasks. However, Windows Task Scheduler offers an effective alternative when SQL Server is not available.
You may also be able to use the Azure Scheduler service in your Azure cloud service applications to execute some types of tasks. Azure Scheduler can make HTTP requests to other services, and monitor the outcome of these requests. It is unlikely to be used for initiating on-premises applications and tasks. However, you might find it useful for accessing an HDInsight cluster directly to perform operations such as transferring data and performing management tasks within the cluster—many of these tasks expose a REST API that Azure Scheduler could access. For more information see Azure Scheduler on the Azure website.
Scheduling data refresh in consumers
You can use the Windows Task Scheduler and SQL Server Agent to schedule execution of an SSIS package, console application, or PowerShell script. However, reports and data models that consume the output of the processing workflow might need to be refreshed on their own schedule. You can process SSAS data models in an SSIS control flow by using a SQL Server Agent job or by using PowerShell to run an XMLA command in the SSAS server, but PowerPivot data models that are stored in Excel workbooks cannot be processed using this technique—and must be refreshed separately. Similarly, the refreshing of SQL Server Reporting Services (SSRS) reports that make use of caching or snapshots must be managed separately.
Scheduled data refresh for PowerPivot data models in SharePoint Server
PowerPivot workbooks that are shared in an on-premises SharePoint Server site can be refreshed interactively on-demand, or the workbook owner can define a schedule for automatic data refresh. In a regularly occurring big data process, the owners of shared workbooks are the data stewards for the data models they contain. As such, they must take responsibility for scheduling data refresh at the earliest possible time after updated data is available in order to keep the data models (and reports based on them) up to date.
For data refresh to be successful, the SharePoint Server administrator must have enabled an unattended service account for the PowerPivot service and this account must have access to all data sources in the workbook, as well as all required system rights for Kerberos delegation. The SharePoint administrator can also specify a range of business hours during which automatic scheduled refresh can occur.
For more information see PowerPivot Data Refresh with SharePoint 2013.
Scheduled data refresh for reports in Power BI
Workbooks that have been published as reports in a Power BI site can be configured for automatic data refresh on a specified schedule. This is useful in scenarios where you have used HDInsight to process data and load it into a database (using Sqoop or SSIS), and then built Excel data models based on the data in the database. In a regularly occurring big data process the owners of shared workbooks are the data stewards for the data models they contain, and as such must take responsibility for scheduling data refresh at the earliest possible time after updated data is available in order to keep the data models (and reports based on them) up to date.
Note
At the time this guide was written, automatic refresh was only supported for SQL Server and Oracle.
If the workbook makes use of on-premises data sources, the Power BI administrator must configure a data management gateway that allows access to the on-premises data source from the Power BI service in Office 365. This may be necessary if, for example, your big data process uses HDInsight to filter and shape data that is then transferred to an on-premises SQL Server data warehouse.
For information about configuring a data management gateway for Power BI see Create a Data Management Gateway. For information about scheduling automatic data refresh see Schedule data refresh for workbooks in Power BI for Office 365.
Scheduled data refresh for SSRS reports
SSRS supports two techniques for minimizing requests to report data sources: caching and snapshots. Both techniques are designed to improve the responsiveness of reports, and can be especially useful when data in the data source changes only periodically.
Caching involves fetching the data required to render the report the first time the report is requested, and maintaining a cached copy of the report until a scheduled time at which the cache expires. New data is fetched from the data source on the first request after the cache has expired. In SQL Server 2014 Reporting Services you can cache both datasets and reports.
Snapshots are pre-rendered reports that reflect the data at a specific point in time. You can schedule snapshot creation and then use the snapshots to satisfy report requests, without needing to query the data source again until the next scheduled snapshot.
In a big data process, these techniques can be useful not only for improving report rendering performance but also for including data from Hive tables in reports—even when the HDInsight cluster hosting the Hive tables has been deleted.
Caching and snapshot creation are controlled through the creation of schedules that are managed by Reporting Services. For more information about using these techniques in SSRS see Performance, Snapshots, Caching (Reporting Services).