Monitoring Nodes
A key step in monitoring and maintaining cluster health is to identify any deviance from normal operational state or performance. HPC Cluster Manager enables you to view cluster and node status at a glance, identify problem nodes, and drill down into node details for further investigation.
In this topic:
View cluster status at a glance
In Node Management you can monitor your cluster at a glance using the node List view or the node Heat Map view. In Charts and Report, the monitoring charts display current and recent data about node health and cluster utilization. For more information, see:
Drill down into individual node details
The List and Heat Map views provide a starting point for identifying problem areas. Double-click a compute node to see detailed information such as hardware, operating system properties, and current performance metrics. You can also select one or more nodes, then drill down into the node details to investigate performance.
Run Diagnostic Tests and Reports: Run diagnostic tests on one or more compute nodes.
View Performance Charts: View a chart of the performance metrics for a compute node over time.
View Node Events: View events generated by HPC services on a specific compute node.
Open a Remote Desktop Connection to your Nodes from HPC Cluster Manager: Open a remote desktop session to one or more compute nodes.
Monitor node operations
Tracking recent or ongoing cluster operations is another monitoring aspect that is critical to administrating a cluster. For more information, see:
Correlate the monitoring information between nodes, jobs, operations, and diagnostics
In HPC Job Manager, you can use the Pivot To actions to correlate the monitoring information between nodes, jobs, operations, and diagnostics. For example, you can select one or more nodes in the views pane, and then pivot to the Jobs for the Selected Nodes. This takes you to a job list view that is filtered by the nodes that you selected.
The supported pivot paths are:
Nodes: pivot to jobs, test results, and operations.
Jobs: pivot to nodes.
Test results: pivot to failed nodes, and operations.
Monitor cluster usage and statistics over time
HPC Cluster Manager provides several built-in charts and reports to monitor and analyze cluster resource usage and job and node statistics over time. The HPCReporting database also supports custom reporting. For more information, see Charts and Reports: HPC Cluster Manager.