Nodes being removed from Failover Cluster membership on VMWare ESX?

Welcome to the AskCore blog. Today, we are going to talk about nodes being removed from active Failover Cluster membership when the nodes are hosted on VMWare ESX. I have documented node membership problems in a previous blog:

Having a problem with nodes being removed from active Failover Cluster membership?
https://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

This is a sample of the event you will see in the System Event Log in Event Viewer:

image

One specific problem that I have seen a few times lately is with the VMXNET3 adapters dropping inbound network packets because the inbound buffer is set too low to handle large amounts of traffic. We can easily find out if this is a problem by using Performance Monitor to look at the “Network Interface\Packets Received Discarded” counter.

image

Once you have added this counter, look at the Average, Minimum and Maximum numbers and if they are any value higher than zero, then the receive buffer needs to be adjusted up for the adapter. This problem is documented in VMWare’s Knowledge Base:

Large packet loss at the guest OS level on the VMXNET3 vNIC in ESXi 5.x / 4.x
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495

I hope that this post helps you!

Thanks,

James Burrage
Senior Support Escalation EngineerWindows High Availability Group

Comments

  • Anonymous
    January 01, 2003
    The trick about what to increase it to is difficult to answer as all networks and environments are different. Some increase it by a little, some double the value, some need to go higher. This is a setting in the VMware network card driver and even their article referenced in this blog does not state what you should increase it to. You could start with doubling the value and monitor it. If it needs raising, then increase it. If the dropped packets appear to go away, leave it or lower it and monitor. Unfortunately, it's not really a setting we can say that "x" will resolve the issue as each environment is different.
  • Anonymous
    January 01, 2003
    Nice tip, thanks for sharing James!
  • Anonymous
    January 01, 2003
    Hi there.I tried to figure out what are the recommended values but couldnt find it in the vmware docs.can you refer to these values?thanksShimon
  • Anonymous
    January 01, 2003
    So this problem only affects when the VM vNIC is VMXNet3 not any other type such as e1000 ?
  • Anonymous
    June 23, 2013
    Hi there.I tried to figure out what are the recommended values but couldnt find it in the vmware docs.can you refer to these values?thanksShimon
  • Anonymous
    June 25, 2013
    Wow, nice tip.  I'm seeing huge numbers for packet drops on the replication network.  The default is "not present" so what number should we start with?  Thanks.
  • Anonymous
    November 12, 2013
    The comment has been removed
  • Anonymous
    July 17, 2014
    We don't have the suggested values, you have to contact VMware for that guidance. I don't know what the maximums are unfortunately.
  • Anonymous
    September 05, 2014
    In a cluster with failover issues if you have seperate heartbeat NIC and Public NIC and the Public NIC has value higher than zero and increase buffer on the NIC will fix the failover issues or only if you have issues on heartbeat NIC?
  • Anonymous
    December 31, 2014
    I would like to echo Robbie Foust's request- what are good values to start with?
  • Anonymous
    August 06, 2015
    I have been looking at my servers that have VMXNET3 installed and noticed all of them are set to "Not Present". So I began with the default value set on this VMWare article:http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495

    Once I did this all my counters dropped to 0. Thanks for the info.
  • Anonymous
    August 10, 2015
    @MW the counters are cumulative and they reset to zero once the NICs are reset for any reason. Changing the value resets the NIC briefly, so it will always bring the counters down to zero. You still need to monitor them for discarded packets later.