Azure Batch better handle Unusable nodes

Lucas Sabalka 0

We have an Azure Batch pool with this Scale script:

startingNumberOfVMs = 10;
minVms = 0;
maxNumberofVMs = 1200;
maxAdditionalPerInterval = 200;

pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 2 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second));

desiredVms = min(maxNumberofVMs, pendingTaskSamples);
maxAllowed = $CurrentLowPriorityNodes + maxAdditionalPerInterval;
totalDesiredVms = min(desiredVms, maxAllowed);

$TargetDedicatedNodes = 0;
$TargetLowPriorityNodes = totalDesiredVms;
$NodeDeallocationOption = taskcompletion;

We queue batch jobs through the C# Batch library via CloudTasks, and we set MaxTaskRetryCount = 2 and MaxWallClockTime = TimeSpan.FromHours(12).

We have some tasks that failed on this pool where the task runs out of disk space (we're fixing the cause of that separately). The nodes that ran the failed jobs are marked as Unusable, while the jobs are marked as Active.

How do we set our scaling properly so failed nodes get deallocated and don't stay running -- is that $NodeDeallocationOption = requeue?

What do we need to do so that the jobs where nodes that run out of space show with an error instead of being marked as Active? It seems like maybe they are set to be retried, but there aren't any usable nodes, because the number of Unusable nodes is equal to the $TargetLowPriorityNodes count. Do we need to add the number of Unusable nodes to our $TargetedLowPriorityNodes count?

kobulloc-MSFT 25,651 Reputation points Microsoft Employee

2024-06-18T00:44:48.8966667+00:00

Hello, @Lucas Sabalka ! I am going to do a bit more research but I think the script below is what you are looking for.

$NodeDeallocationOption = requeue will terminate the task and put it back on the job queue so that it is rescheduled whereas terminate will terminate the task and remove it from the job queue.

When a node is unusable, tasks can't be scheduled on the node (but the node can be used to investigate the cause of the failure).

There is a PowerShell script that will detect all unusable nodes in a batch account/pool and delete them:

https://github.com/gjjtip0926/AzureBatchService/blob/main/DeleteUsuableNodes.ps
Lucas Sabalka 0 Reputation points

2024-06-18T16:13:45.48+00:00

Thanks for your answer, @kobulloc-MSFT !

You describe what requeue and terminate do to the task. Either way, this doesn't sound like what we're looking for. So I understand better, what does taskcompletion mean for the task? When is Node Deallocation triggered -- is it just with manual deallocation and autoscaling, or does it also trigger when a node becomes Unusable?

I think the script you provided runs outside of the pool or its scaling script. Is there a way to modify the autoscale script to account for the Unusable nodes automatically without requiring manual intervention or running that script?

Share via

Azure Batch better handle Unusable nodes