Hi, thanks for reaching out. GpuUtilization shows how much percentage of GPU was utilized for a given node during a run/job. One node can have one or more GPUs. This metric is published per GPU per node. You can apply filters based on node to understand the computation better. Let me know if that helps or if you need further assistance. Thanks.
How is AML's average GpuUtilization metric computed?
How is the "GpuUtilization" metric computed for an AML workspace? What are the inputs and what is the equation used to compute GpuUtilization?
The "metrics" tab in the AML web portal shows a chart of the GpuUtilization over a specified time period, along with the average GpuUtilization for that time period. However, I have found that average GpuUtilization does not appear to accurately reflect the data shown in the chart for some of my organization's AML workspaces.
For example, the following screenshot shows the GpuUtilization for July 1-31, with the average GpuUtilization reported as 54.06. This is clearly much higher than what is shown in the chart. When I download the data from the chart (Share -> Download to Excel), I compute the average GpuUtilization to be ~11% in Excel. Why is there such a discrepancy?
I have found similar discrepancies for other AML workspaces as well. However, the average GpuUtilization appears to be more accurate for the August 1-25 time period than it is for July 1-31. I wish to better understand how AML computes the average GpuUtilization over a time period so we can accurately account for my organization's AML GPU usage on a per-workspace basis.