All our S2D Clusters suddenly freeze and fail when one node is offline

Jemandanders 0 Reputation points
2024-05-23T22:29:00.9366667+00:00

Hello,

we are running 12 hyperconverged Hyper-V / S2D Failover Clusters on Server 2019 which were installed during the last 3 years and everything was running fine and stable until a few month ago.

3 of them are 2 node clusters with 4-way mirrored / nested-resiliency with HDDs for capacity and Journal/Cache NVMes.

9 of them are 3 node clusters with 3-way mirrored all-flash SSD storage.

All nodes use distinguished file-share wittnesses.

All of them now have the same issue since a few months (i guess since windows updates):

When a node goes offline (no matter if planned or a failure) after a few minutes to hours the S2D Storage and all VMs on the remaining nodes become slower and slower until the point where they don't react anymore and may even fail completely and also can't be started again.

A few minutes after the missing node ist back online, everthing becomes fast again and failed VMs can be started again. (Even before the Storage Repair Jobs are finished)

This is catastrophic to us, because instead of high availability, we now have 1 failing node taking out 2 or 3 nodes in total.

The Problem seems to be worse the more "load" is on the cluster, while even verry little load is enough. Talking about 1 digit % CPU load an 1 digit MB/s on the storage. It seems the remaining nodes are waitig for something they can't reach or get stuck in something they can't handle and don't or slowly respont to anything else in the meantime. I can't find any reasons in the logs and there is no unusal load on the hosts visible.

The symptoms are:

VMs:

  • become slower and slower.
  • Even simple things like opening Start menu, Windows Explorer, or just opening another Folder or clicking a buttion inside a software first takes seconds to respond, then minutes, then dozens of minutes.
  • and finally the VM crashes:
    • it's not reachable over the network
    • the VM Connection / Console Windows does not respond at all, can't even click shutdown/reboot/start.
    • Also can't click anything in the vms dropdown-menu in failover cluster manager.
      • Can't even open new VM Console Windows in failover cluster manager.
      • Hyper-V Manger doesn't show any vms, just displays "there are no vms on this host"
      • Failover Cluster Manager shows some of them as failed, others as suspended, some as running (while none is responding)
  • if the failed node is back online, some VMs just resume, others are off, others are in error state but can now be started again.

C:\ClusterStorage....

  • first opens slowly, again starts with seconds, minutes, and finally shows just empty folders or errors while powerhsell and Failovercluster Manager shows the virtual disks as online/degraded but NOT failed or datached.

Powershell:

  • everything related to Hyper-V and S2D takes very long, again seconds to minutes to dozenz of minutes. (VirtualDisk, PhysicaDisk, StorageJob, etc)
  • Get-VirtualDisk shows the disks as degraded and never as failed or detached, even when all VMs are crashed and C:\ClusterStorage cant be accessed. It must be some timeout problem.

There seems to pile up a queue on the S2D storage because the remaining nodes don't respond while waiting for something they won't get or are stuck in something they can't do until the failed node ist back.

There are entries in the event log, but none of them is a source of the problem, just the outcomes, like:

Hyper-V-StorageVSP

Event-ID 9:

I/O-Request for "C:\ClusterStorage....\VM-Name...vhdx" took 549629 milliseconds.

This is displayed for READ and WRITE for all VMs multiple times per second.

The time varies but is above 100000 right from the first occurence.

In the case im looking at right now, these events as well as the slowing down of vms started around 13 Minutes after softly shutting down a node.

Does anyone have a solution or guidance where to look? I don't see any hints what the remaining nodes are busy with while not answering the storage and vm requests.

I've read articles about something like this happening in Server 2016 after updates, and the workaround was putting all disks into maintenance mode before taking a node offline. (While i don't see the matching Event-ID in the logs.) Maintenance mode seems to make the problem better (can also be coincedene, since I only did this outside working hours) but doesn't solve it completely. Also if a node fails unplannend, we can't put the disks in maintanance mode afterwards. And even if we manually could, we still would have downtime in the meantime. Also the problem sometimes happens just after we take disks out of maintenance mode again.

Thanks

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,610 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
973 questions
Windows Server Storage
Windows Server Storage
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Storage: The hardware and software system used to retain data for subsequent retrieval.
642 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Net Runner 600 Reputation points
    2024-05-29T14:24:28.14+00:00

    We had similar problems with the 4+ node clusters (including S2D-ready nodes). I think there is a problem with the storage resync/rebalance queue that does not switch to use available nodes/storage and keeps building up to the point where the entire cluster gets stuck in 3/4-way mirrored scenarios. We never had such a problem with 2-way mirrored storage pools.

    Possible fixes/workarounds we used to stabilize our customer's environments:

    1 person found this answer helpful.

  2. Ian Xue (Shanghai Wicresoft Co., Ltd.) 33,301 Reputation points Microsoft Vendor
    2024-05-27T04:16:17.48+00:00

    Hi Jemandanders,

    Hope you're doing well.

    1. As you suspect updates might be the cause, first verify if there have been any recent updates that coincide with the start of your issues. Check if any updates related to Hyper-V, Failover Clustering, or Storage Spaces Direct have been installed.
    2. Check the health of your S2D cluster using "Get-ClusterPerf" and "Get-ClusterLog". The cluster log can provide detailed information about what's happening when nodes go offline. Then use the "Test-Cluster" cmdlet to run a full diagnostic test on your cluster.
    3. Ensure that your network configuration is optimal for S2D and Failover Clustering. Any network issues can cause significant delays in I/O operations. Verify network performance and latency between the nodes using tools like "ping" and "Test-Cluster" cmdlet.
    4. Check event logs for any errors or warnings related to clustering, storage, and Hyper-V.
    5. Ensure that all firmware and drivers, especially for storage and network components, are up to date. ble.

    Best Regards,

    Ian Xue


    If the Answer is helpful, please click "Accept Answer" and upvote it.