Server 2016 Failover Cluster Blues

mdrudge 1 Reputation point
2020-09-22T21:36:33.987+00:00

Hi, I'm trying to set up MS Failover Clustering in our environment using Server 2016, and I just can't seem to get it to work. I have nodes dropping out at least once a day, and everytime I launch the Failover Cluster MMC, I get a cluster failure. I've gotten to the point where I'm running two test servers with no agents, no services, and no shared disks, just to work on getting their heartbeats working and the cluster stable. Note that these servers are virtual, and they live in a vSphere 6.7 environment. I have put them both on the same ESXi host to eliminate that variable as well. I followed the following steps:

  1. Installed the Failover Clustering feature on the two servers.
  2. Set up a GPO with all the firewall settings, to ensure that the two servers could communicate via the failover ports.
  3. Added a second NIC on a non-routeable network without a gateway for heartbeat purposes.
  4. Ran the Cluster validation and passed.
  5. Set up the cluster using a pre-staged DNS object.
  6. Gave the cluster computer object permissions to the DNS record.
  7. Set up a file share witness on an SMB share that exists on the same subnet as the client traffic. Gave the cluster computer object modify permissions to the share.
  8. Set the following settings on the cluster via powershell:
    Set the following settings via powershell:
    (get-cluster).SameSubnetDelay = 2000
    (get-cluster).SameSubnetThreshold = 10
    (get-cluster).RouteHistoryLength = 20
    (get-cluster).crosssubnetdelay = 2000
    (get-cluster).crosssubnetthreshold = 10

In the course of troubleshooting, I made the following changes:

  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters -REG_DWORD > ArpRetryCount - Created this key and set to 3
  • Set "Notify Switches" to "no" in the vSphere portgroup for the heartbeat network
  • Set (get-cluster).SameSubnetThreshold = 20
  • Turned off IPv6 on all interfaces
  • Ensured the same patch and latest VMTools version on the servers

I have looked at the clusters log, and it appears to be a network issue.

I have found entries like the following:

15:31:33.082 ERR [CHANNEL fe80::98b5:fd65:fe3c:2d42%5:~3343~]/recv: Failed to retrieve the results of overlapped I/O: 10060
00000378.0000133c::2020/09/16-15:31:33.082 WARN [CHANNEL fe80::98b5:fd65:fe3c:2d42%5:~3343~] failure, status 10060
00000378.0000133c::2020/09/16-15:31:33.082 WARN [PULLER Server2] ReadObject failed with (10060)' because of 'channel to remote endpoint fe80::98b5:fd65:fe3c:2d42%5:~3343~ has failed with status 10060'
00000378.0000133c::2020/09/16-15:31:33.082 ERR [NODE] Node 1: Connection to Node 2 is broken. Reason (10060)' because of 'channel to remote endpoint fe80::98b5:fd65:fe3c:2d42%5:~3343~ has failed with status 10060'

My research is showing that IPv6 should be enabled on all the interfaces in order for the heartbeat packets to appear. I enabled IPv6 on the two interfaces, and ran the following commands to make sure that the settings were reset on any hidden interfaces:

Set-Net6to4Configuration -State Default
Set-NetTeredoConfiguration -Type Default
Set-NetIsatapConfiguration -State Default

Unfortunately, I found that enabling IPv6 increased the amount of failure events.

I also set the Network Interfaces so that the client NIC would be first in the binding order, followed by the cluster NIC on both VMs.

What traffic to I need to ensure is getting through, between the two nodes and the File Share Witness? Does IPv6 need to be enabled to connect to the FSW? I just want to know what settings I'm missing here because whatever the issue is, it's not in the instructions for setting up a cluster and I've been looking at this for too long.

Thanks for the help!

Windows Server 2016
Windows Server 2016
A Microsoft server operating system that supports enterprise-level management updated to data storage.
2,484 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
992 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Xiaowei He 9,891 Reputation points
    2020-09-23T07:47:05.7+00:00

    everytime I launch the Failover Cluster MMC, I get a cluster failure.

    Could you please provide the screenshot of the Cluster failover when launchthe FCM?

    Please try to add an additional NIC on both nodes, then configure a cluster network as "Cluster only", check if it could help make the heartbeat network between the nodes more stable.

    Besides, please check if we can update the virtual NIC driver from vSphere aspect.

    Thanks for your time!
    Best Regards,
    Anne

    -----------------------------

    If the Answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.