Hello,
we are running 12 hyperconverged Hyper-V / S2D Failover Clusters on Server 2019 which were installed during the last 3 years and everything was running fine and stable until a few month ago.
3 of them are 2 node clusters with 4-way mirrored / nested-resiliency with HDDs for capacity and Journal/Cache NVMes.
9 of them are 3 node clusters with 3-way mirrored all-flash SSD storage.
All nodes use distinguished file-share wittnesses.
All of them now have the same issue since a few months (i guess since windows updates):
When a node goes offline (no matter if planned or a failure) after a few minutes to hours the S2D Storage and all VMs on the remaining nodes become slower and slower until the point where they don't react anymore and may even fail completely and also can't be started again.
A few minutes after the missing node ist back online, everthing becomes fast again and failed VMs can be started again. (Even before the Storage Repair Jobs are finished)
This is catastrophic to us, because instead of high availability, we now have 1 failing node taking out 2 or 3 nodes in total.
The Problem seems to be worse the more "load" is on the cluster, while even verry little load is enough. Talking about 1 digit % CPU load an 1 digit MB/s on the storage.
It seems the remaining nodes are waitig for something they can't reach or get stuck in something they can't handle and don't or slowly respont to anything else in the meantime.
I can't find any reasons in the logs and there is no unusal load on the hosts visible.
The symptoms are:
VMs:
- become slower and slower.
- Even simple things like opening Start menu, Windows Explorer, or just opening another Folder or clicking a buttion inside a software first takes seconds to respond, then minutes, then dozens of minutes.
- and finally the VM crashes:
- it's not reachable over the network
- the VM Connection / Console Windows does not respond at all, can't even click shutdown/reboot/start.
- Also can't click anything in the vms dropdown-menu in failover cluster manager.
- Can't even open new VM Console Windows in failover cluster manager.
- Hyper-V Manger doesn't show any vms, just displays "there are no vms on this host"
- Failover Cluster Manager shows some of them as failed, others as suspended, some as running (while none is responding)
- if the failed node is back online, some VMs just resume, others are off, others are in error state but can now be started again.
C:\ClusterStorage....
- first opens slowly, again starts with seconds, minutes, and finally shows just empty folders or errors while powerhsell and Failovercluster Manager shows the virtual disks as online/degraded but NOT failed or datached.
Powershell:
- everything related to Hyper-V and S2D takes very long, again seconds to minutes to dozenz of minutes. (VirtualDisk, PhysicaDisk, StorageJob, etc)
- Get-VirtualDisk shows the disks as degraded and never as failed or detached, even when all VMs are crashed and C:\ClusterStorage cant be accessed. It must be some timeout problem.
There seems to pile up a queue on the S2D storage because the remaining nodes don't respond while waiting for something they won't get or are stuck in something they can't do until the failed node ist back.
There are entries in the event log, but none of them is a source of the problem, just the outcomes, like:
Hyper-V-StorageVSP
Event-ID 9:
I/O-Request for "C:\ClusterStorage....\VM-Name...vhdx" took 549629 milliseconds.
This is displayed for READ and WRITE for all VMs multiple times per second.
The time varies but is above 100000 right from the first occurence.
In the case im looking at right now, these events as well as the slowing down of vms started around 13 Minutes after softly shutting down a node.
Does anyone have a solution or guidance where to look? I don't see any hints what the remaining nodes are busy with while not answering the storage and vm requests.
I've read articles about something like this happening in Server 2016 after updates, and the workaround was putting all disks into maintenance mode before taking a node offline. (While i don't see the matching Event-ID in the logs.)
Maintenance mode seems to make the problem better (can also be coincedene, since I only did this outside working hours) but doesn't solve it completely.
Also if a node fails unplannend, we can't put the disks in maintanance mode afterwards. And even if we manually could, we still would have downtime in the meantime.
Also the problem sometimes happens just after we take disks out of maintenance mode again.
Thanks