Hi, I'm trying to set up MS Failover Clustering in our environment using Server 2016, and I just can't seem to get it to work. I have nodes dropping out at least once a day, and everytime I launch the Failover Cluster MMC, I get a cluster failure. I've gotten to the point where I'm running two test servers with no agents, no services, and no shared disks, just to work on getting their heartbeats working and the cluster stable. Note that these servers are virtual, and they live in a vSphere 6.7 environment. I have put them both on the same ESXi host to eliminate that variable as well. I followed the following steps:
- Installed the Failover Clustering feature on the two servers.
- Set up a GPO with all the firewall settings, to ensure that the two servers could communicate via the failover ports.
- Added a second NIC on a non-routeable network without a gateway for heartbeat purposes.
- Ran the Cluster validation and passed.
- Set up the cluster using a pre-staged DNS object.
- Gave the cluster computer object permissions to the DNS record.
- Set up a file share witness on an SMB share that exists on the same subnet as the client traffic. Gave the cluster computer object modify permissions to the share.
- Set the following settings on the cluster via powershell:
Set the following settings via powershell:
(get-cluster).SameSubnetDelay = 2000
(get-cluster).SameSubnetThreshold = 10
(get-cluster).RouteHistoryLength = 20
(get-cluster).crosssubnetdelay = 2000
(get-cluster).crosssubnetthreshold = 10
In the course of troubleshooting, I made the following changes:
- HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters -REG_DWORD > ArpRetryCount - Created this key and set to 3
- Set "Notify Switches" to "no" in the vSphere portgroup for the heartbeat network
- Set (get-cluster).SameSubnetThreshold = 20
- Turned off IPv6 on all interfaces
- Ensured the same patch and latest VMTools version on the servers
I have looked at the clusters log, and it appears to be a network issue.
I have found entries like the following:
15:31:33.082 ERR [CHANNEL fe80::98b5:fd65:fe3c:2d42%5:~3343~]/recv: Failed to retrieve the results of overlapped I/O: 10060
00000378.0000133c::2020/09/16-15:31:33.082 WARN [CHANNEL fe80::98b5:fd65:fe3c:2d42%5:~3343~] failure, status 10060
00000378.0000133c::2020/09/16-15:31:33.082 WARN [PULLER Server2] ReadObject failed with (10060)' because of 'channel to remote endpoint fe80::98b5:fd65:fe3c:2d42%5:~3343~ has failed with status 10060'
00000378.0000133c::2020/09/16-15:31:33.082 ERR [NODE] Node 1: Connection to Node 2 is broken. Reason (10060)' because of 'channel to remote endpoint fe80::98b5:fd65:fe3c:2d42%5:~3343~ has failed with status 10060'
My research is showing that IPv6 should be enabled on all the interfaces in order for the heartbeat packets to appear. I enabled IPv6 on the two interfaces, and ran the following commands to make sure that the settings were reset on any hidden interfaces:
Set-Net6to4Configuration -State Default
Set-NetTeredoConfiguration -Type Default
Set-NetIsatapConfiguration -State Default
Unfortunately, I found that enabling IPv6 increased the amount of failure events.
I also set the Network Interfaces so that the client NIC would be first in the binding order, followed by the cluster NIC on both VMs.
What traffic to I need to ensure is getting through, between the two nodes and the File Share Witness? Does IPv6 need to be enabled to connect to the FSW? I just want to know what settings I'm missing here because whatever the issue is, it's not in the instructions for setting up a cluster and I've been looking at this for too long.
Thanks for the help!