Service Management Automation (SMA) Troubleshooting: queued runbook jobs

The Problem

Random SMA runbook jobs were intermittently getting stuck and remained in a “Queued” job status. The only thing, which brought a temporary “relief”, was the restart of the Runbook Service on the affected server. After the restart, the queued jobs were again processed successfully. Unfortunately, after some time, other random job instances were again hanging and left in a queued status.

https://blog.pohn.ch/wp-content/uploads/2016/10/2016-09-29_15-03-11.png

The Queued Status

The “queued” status itself means that the runbook job instance has already been assigned to a worker, participating in the deployment, but waits to be executed. If you want to know more details about the execution of the runbook jobs, please take a peak in the article series Michael Rueefli wrote on the topic:

Troubleshooting SMA (Service Management Automation) – Part 1

The Troubleshooting

Start troubleshooting by examining the SMA related event logs. You can find the SMA Operational event log by navigating to “Applications and Services Logs” and then to “Microsoft-ServiceManagementAutomation”. The first thing I noticed is that there were absolutely no “bad” events – neither Warnings nor Errors. So I took a closer look at the events, logged within the time frame, when the runbook job got queued and I found only an event that the job has been started (Event ID 50008, Source: Microsoft-ServiceManagementAutomation, Description: The runbook job was started. Runbook: ‚Name of your Runbook‘. Requested by: Service Account Name).

The next troubleshooting step was to ensure that my SMA configuration parameters were correct. By examining the “Orchestrator.Settings.config” (…:\Program Files\Microsoft System Center 2012 R2\Service Management Automation), I could confirm that the relevant settings were left default:

<!--The values used to configure the PowerShell runtime-->
<add key="MaxRunningJobs" value="30"/>
<add key="TotalAllowedJobs" value="1000"/>
<add key="MaxRunningJobsPerWorker" value="120"/>

A short definition of the configuration parameters can be found here:

Monitoring and troubleshooting in Service Management Automation

*"The values in the file are:  *

  • MaxRunningJobs – The number of jobs that can run concurrently in a Sandbox.

  • TotalAllowedJobs – The total number of jobs that a Sandbox can process during its lifetime. When this limit is hit, the Sandbox is no longer assigned new jobs and the existing jobs are allowed to complete. After that, the Sandbox is disposed.

  • MaxRunningJobsPerWorker – The number of concurrent jobs that can run in all the existing Sandboxes on a Runbook Worker at a time.

  • MaxConcurrentSandboxes – The number of Sandboxes that can run on a Runbook Worker at once. A new Sandbox is created to handle new modules versions or to handle the case when the existing sandbox has reached the limit set on TotalAllowedJobs.

The suggested limit on the number of concurrent jobs that can be run on any particular worker (MaxRunningJobsPerWorker) defaults to 120. You can modify this number, although we don’t recommend increasing it unless you know that your workload consists mostly of non-resource-intensive runbooks such as monitoring jobs that don’t consume many resources but that run for long periods of time."

So where to next? A short research on Internet showed that a specific GPO setting (Turn On Script Execution -> Allow all scripts) could cause the SMA Runbook Service to “hang”. After reviewing the “gpresult ” on the SMA servers, I could confirm that this particular setting was not applied.

At this stage I was left with two options:

Create a logman trace against the SMA provider (Provider Name=“Microsoft-ServiceManagementAutomation“,Guid=“{2225E960-DE42-45EA-9940-DB3C9DC96AA}“ and examine the output of the files, generated by it – xml and summary.txt, as suggested in the same blog (see above) Dump the Orchestrator.Sandbox process (manual process dump) and provide the dumps to Microsoft for analysis.

Identifying the cause and solving the problem

In order to speed up the resolution process, manual process dumps have been generated and provided to Microsoft Support for further analysis. The result was that a known issue (a bug) with the Orchestrator.Sandbox process in conjunction with SMA 2012 R2 and Windows Management Framework (WMF) 5.0 has been identified.

The behavior occurs only when two or more Runbook jobs are running within the same Orchestrator.Sandbox process. In this case, the process gets deadlocked when converting the runbook code to XAML and stops responding to commands (it stops executing jobs and also cannot be gracefully shutdown by the Service Control Manager as part of the Runbook Service restart action).

I got a confirmation that the Product Group is actively working on fixing this, but until this is done, those who are affected by the issue have two possible workarounds:

Workaround 1 – Uninstall Windows Management Framework (WMF) 5.0 and revert back to v4.0

This particular workaround has also been labeled by the Microsoft Support as “not recommended”, simply because it will solve the issue with the queued SMA jobs, but will expose many known issues and bugs, which have been fixed with WMF 5.0. Also, in my opinion, this is definitely not the way to go.

Workaround 2 – Force the runbook job execution within separate Orchestrator.Sandbox processes

Sounds fancy, right? Before I explain in more details how to do this I would like to note that it is of a big importance how you are calling your nested runbooks . If you are nesting your workflows (Child 1, Child 2, Child N) within the same parent workflow (Parent A) and are calling them directly you will end up with one job, which will be executed within the same Orchestrator Sandbox process. If you call your child runbook with Start-SMARunbook, then you will have multiple jobs (equal the number of called runbooks ) and also one (or more) Orchestrator.Sandbox processes (depending on the settings in “Orchestrator.Settings.config”, please read further). Stefan Roth has already posted some nice details on this one:

SMA - About Jobs & Sandboxes

So what would could separate the jobs by forcing them to run within different Sandbox processes? The settings in the .config file, of course.

By default, one Orchestrator.Sandbox process handles 30 concurrent jobs, meaning If we change this setting (MaxRunningJobs ) to 1, then we should be able to force the creation of individual processes for each job? That is half of the truth. In order to be able to execute multiple separate jobs within the processes and do this in parallel , we will have to edit one more setting in the config file – TotalAllowedJobs and set the value of 1 (one). This would mean that each Sandbox process will immediately hit the limit of TotalAllowedJobs, thus allowing the creation of new Sandbox process for the next job. So at the end, the needed modifications will be:

<!--The values used to configure the PowerShell runtime-->
<add key="MaxRunningJobs" value="1"/>
<add key="TotalAllowedJobs" value="1"/>

Please be careful when your are doing the configuration and make sure both parameters are properly set otherwise you can end up in executing your jobs sequentially ("MaxRunningJobs" value="1" and  ** "TotalAllowedJobs"** then 1 [>1]) instead of in parallel.

Conclusion

This post will help in making your SMA experience even better and save you some precious time in research and troubleshooting. If you ever experience this particular behavior and need to implement one of both workarounds, please just keep in mind that the recommended one is separating the jobs into different processes. Uninstalling WMF 5.0 (downgrade to WMF 4.0) is not preferred as it will not only deprive your environment of the great features, introduced with v.5.0 (see the WMF 5.0 release notes), but also expose your server to all the issues that have been fixed with 5.0.