All our S2D Clusters suddenly freeze and fail when one node is offline

Question

Hello,

we are running 12 hyperconverged Hyper-V / S2D Failover Clusters on Server 2019 which were installed during the last 3 years and everything was running fine and stable until a few month ago.

3 of them are 2 node clusters with 4-way mirrored / nested-resiliency with HDDs for capacity and Journal/Cache NVMes.

9 of them are 3 node clusters with 3-way mirrored all-flash SSD storage.

All nodes use distinguished file-share wittnesses.

All of them now have the same issue since a few months (i guess since windows updates):

When a node goes offline (no matter if planned or a failure) after a few minutes to hours the S2D Storage and all VMs on the remaining nodes become slower and slower until the point where they don't react anymore and may even fail completely and also can't be started again.

A few minutes after the missing node ist back online, everthing becomes fast again and failed VMs can be started again. (Even before the Storage Repair Jobs are finished)

This is catastrophic to us, because instead of high availability, we now have 1 failing node taking out 2 or 3 nodes in total.

The Problem seems to be worse the more "load" is on the cluster, while even verry little load is enough. Talking about 1 digit % CPU load an 1 digit MB/s on the storage. It seems the remaining nodes are waitig for something they can't reach or get stuck in something they can't handle and don't or slowly respont to anything else in the meantime. I can't find any reasons in the logs and there is no unusal load on the hosts visible.

The symptoms are:

VMs:

become slower and slower.
Even simple things like opening Start menu, Windows Explorer, or just opening another Folder or clicking a buttion inside a software first takes seconds to respond, then minutes, then dozens of minutes.
and finally the VM crashes:
- it's not reachable over the network
- the VM Connection / Console Windows does not respond at all, can't even click shutdown/reboot/start.
- Also can't click anything in the vms dropdown-menu in failover cluster manager.
  - Can't even open new VM Console Windows in failover cluster manager.
  - Hyper-V Manger doesn't show any vms, just displays "there are no vms on this host"
  - Failover Cluster Manager shows some of them as failed, others as suspended, some as running (while none is responding)
if the failed node is back online, some VMs just resume, others are off, others are in error state but can now be started again.

C:\ClusterStorage....

first opens slowly, again starts with seconds, minutes, and finally shows just empty folders or errors while powerhsell and Failovercluster Manager shows the virtual disks as online/degraded but NOT failed or datached.

Powershell:

everything related to Hyper-V and S2D takes very long, again seconds to minutes to dozenz of minutes. (VirtualDisk, PhysicaDisk, StorageJob, etc)
Get-VirtualDisk shows the disks as degraded and never as failed or detached, even when all VMs are crashed and C:\ClusterStorage cant be accessed. It must be some timeout problem.

There seems to pile up a queue on the S2D storage because the remaining nodes don't respond while waiting for something they won't get or are stuck in something they can't do until the failed node ist back.

There are entries in the event log, but none of them is a source of the problem, just the outcomes, like:

Hyper-V-StorageVSP

Event-ID 9:

I/O-Request for "C:\ClusterStorage....\VM-Name...vhdx" took 549629 milliseconds.

This is displayed for READ and WRITE for all VMs multiple times per second.

The time varies but is above 100000 right from the first occurence.

In the case im looking at right now, these events as well as the slowing down of vms started around 13 Minutes after softly shutting down a node.

Does anyone have a solution or guidance where to look? I don't see any hints what the remaining nodes are busy with while not answering the storage and vm requests.

I've read articles about something like this happening in Server 2016 after updates, and the workaround was putting all disks into maintenance mode before taking a node offline. (While i don't see the matching Event-ID in the logs.) Maintenance mode seems to make the problem better (can also be coincedene, since I only did this outside working hours) but doesn't solve it completely. Also if a node fails unplannend, we can't put the disks in maintanance mode afterwards. And even if we manually could, we still would have downtime in the meantime. Also the problem sometimes happens just after we take disks out of maintenance mode again.

Thanks

Answer

We had similar problems with the 4+ node clusters (including S2D-ready nodes). I think there is a problem with the storage resync/rebalance queue that does not switch to use available nodes/storage and keeps building up to the point where the entire cluster gets stuck in 3/4-way mirrored scenarios. We never had such a problem with 2-way mirrored storage pools.

Possible fixes/workarounds we used to stabilize our customer's environments:

Switch to 2-way mirroring where applicable https://video2.skills-academy.com/en-us/azure-stack/hci/concepts/nested-resiliency (I believe you need four nodes for this to work reliably).
Upgrade to Windows Server 2022 (never witnessed that behavior on the latest Windows Server).
Replace S2D with Virtual SAN https://www.starwindsoftware.com/vsan.

Answer

Hi Jemandanders,

Hope you're doing well.

As you suspect updates might be the cause, first verify if there have been any recent updates that coincide with the start of your issues. Check if any updates related to Hyper-V, Failover Clustering, or Storage Spaces Direct have been installed.
Check the health of your S2D cluster using "Get-ClusterPerf" and "Get-ClusterLog". The cluster log can provide detailed information about what's happening when nodes go offline. Then use the "Test-Cluster" cmdlet to run a full diagnostic test on your cluster.
Ensure that your network configuration is optimal for S2D and Failover Clustering. Any network issues can cause significant delays in I/O operations. Verify network performance and latency between the nodes using tools like "ping" and "Test-Cluster" cmdlet.
Check event logs for any errors or warnings related to clustering, storage, and Hyper-V.
Ensure that all firmware and drivers, especially for storage and network components, are up to date. ble.

Best Regards,

Ian Xue

If the Answer is helpful, please click "Accept Answer" and upvote it.

Answer

Dear colleagues, was it possible to determine what the problem was, did the installation of Windows 2022 solve the problem?

Share via

All our S2D Clusters suddenly freeze and fail when one node is offline

3 answers

Your answer