Azure Batch better handle Unusable nodes

Lucas Sabalka 0 Reputation points
2024-06-17T17:56:35.3433333+00:00

We have an Azure Batch pool with this Scale script:

startingNumberOfVMs = 10;
minVms = 0;
maxNumberofVMs = 1200;
maxAdditionalPerInterval = 200;

pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 2 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second));

desiredVms = min(maxNumberofVMs, pendingTaskSamples);
maxAllowed = $CurrentLowPriorityNodes + maxAdditionalPerInterval;
totalDesiredVms = min(desiredVms, maxAllowed);

$TargetDedicatedNodes = 0;
$TargetLowPriorityNodes = totalDesiredVms;
$NodeDeallocationOption = taskcompletion;

We queue batch jobs through the C# Batch library via CloudTasks, and we set MaxTaskRetryCount = 2 and MaxWallClockTime = TimeSpan.FromHours(12).

We have some tasks that failed on this pool where the task runs out of disk space (we're fixing the cause of that separately). The nodes that ran the failed jobs are marked as Unusable, while the jobs are marked as Active.

How do we set our scaling properly so failed nodes get deallocated and don't stay running -- is that $NodeDeallocationOption = requeue?

What do we need to do so that the jobs where nodes that run out of space show with an error instead of being marked as Active? It seems like maybe they are set to be retried, but there aren't any usable nodes, because the number of Unusable nodes is equal to the $TargetLowPriorityNodes count. Do we need to add the number of Unusable nodes to our $TargetedLowPriorityNodes count?

Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
320 questions
{count} votes