Failover Cluster Node State is “Down” and Cluster Service Terminates or Adding a New Failover Cluster Node Fails with Time Out Error

Situation

Consider one of the following scenarios.

  1. You have a failover cluster. In the Nodes node of Failover Cluster Manager MMC, the Status for one or more nodes is displayed as Down. The server(s) are actually up and Cluster service is running. Later, the Cluster service is terminated due to timeout error and restarted. There are no more relevant messages in Event Logs.
  2. You're trying to add a new node to a failover cluster. The Add Node Wizard passes you to the Configure the Cluster page where the progress bar is displayed. The progress bar hangs for an extended period of time with the status of “Waiting for notification that the node <Node Name> is a fully functional member of the cluster”. Later, the status changes to “Unable to successfully cleanup”. Finally, the Add Node Wizard fails with the following error message.
    The server '<Node FQDN>' could not be added to the cluster.
    An error occurred while adding node '<Node FQDN>' to cluster '<Cluster Name>'.
    This operation returned because the timeout period expired

Symptoms

The following is the only relevant error message that appears in the node's System Event Log if the node experiencing this issue is already a cluster member (Scenario 1 listed above).

Log Name:      System
Source:        Service Control Manager
Date:          16.07.2011 14:06:26
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      <Node FQDN>
Description:
The Cluster Service service terminated with service-specific error The wait operation timed out..
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    <System>
      <Provider Name="Service Control Manager"
                Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}"
                EventSourceName="Service Control Manager" />
      <EventID Qualifiers="49152">7024</EventID>
      <Version>0</Version>
      <Level>2</Level>
      <Task>0</Task>
      <Opcode>0</Opcode>
      <Keywords>0x8080000000000000</Keywords>
      <TimeCreated SystemTime="2011-07-16T10:06:26.443710600Z" />
      <EventRecordID>15132</EventRecordID>
      <Correlation />
      <Execution ProcessID="788" ThreadID="3484" />
      <Channel>System</Channel>
      <Computer>Node FQDN</Computer>
      <Security />
    </System>
    <EventData>
      <Data Name="param1">Cluster Service</Data>
      <Data Name="param2">%%258</Data>
    </EventData>
</Event>

If the node is not a cluster member yet and you try to add it (Scenario 2 listed above), the following event might be logged in addition to the above one.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          09.07.2011 5:21:30
Event ID:      1572
Task Category: Cluster Virtual Adapter
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      <Node FQDN>
Description:
Node '<Node Name>' failed to join the cluster because it could not send and receive failure detection network messages with other cluster nodes. Please run the Validate a Configuration wizard to ensure network settings. Also verify the Windows Firewall 'Failover Clusters' rules.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-FailoverClustering" 
              Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
    <EventID>1572</EventID>
    <Version>0</Version>
    <Level>1</Level>
    <Task>39</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2011-07-09T01:21:30.183978000Z" />
    <EventRecordID>8752339</EventRecordID>
    <Correlation />
    <Execution ProcessID="6388" ThreadID="6840" />
    <Channel>System</Channel>
    <Computer>Node FQDN</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="NodeName">Node Name</Data>
  </EventData>
</Event>

Note

The latter error message does not consistently appear on every repro. So it can be used only as an additional symptom of the issue.

More Information

If you run Failover Cluster Validation Wizard it founds no issues since all the necessary firewall rules are in place and enabled.

(It would help, though, if the issue is with Firewall Rules or network connectivity indeed. See the links section at the end of this article for more details on such cases).

Cause

If the Failover Cluster Validation Wizard doesn't detect the issue it is most likely due to the state of Windows Firewall. It can be a problem with the switch configuration. (For example Auto DoS / Storm Protection in some HP's switch will block the UDP's packet conversation in the initial handshake)

Resolution

Launch Server Manager MMC for the servers in question. Navigate to ConfigurationWindows Firewall With Advanced Security. From the Actions pane, click Properties. Ensure that for all profiles (not only the Domain one) the Inbound connections setting is not set to Block all connections. Acceptable options are either Block (default) or Allow.
**
**If the switch, in an HP it should look that way:

Additional Troubleshooting Steps

If you are unsure whether the cluster problems are caused by Windows Firewall you may use the following command to temporarily disable the firewall on all cluster nodes at once.

001
002
003
004
005
006
007
008
009
Import-Module -Name "FailoverClusters"
$Node = Get-Cluster -Name "Cluster.Contoso.com" | 
    Get-ClusterNode |
        Select-Object -ExpandProperty "Name"
$Command = {
    Set-Alias -Name "NetSh" -Value "$Env:SystemRoot\System32\NetSh.exe"
    NetSh AdvFirewall Set AllProfiles State "Off"
}
Invoke-Command -ScriptBlock $Command -ComputerName $Node

Note

This will not work if the state of Windows Firewall is enforced with Group Policy, or if [[Windows PowerShell Remoting]] is disabled.

Note

Under no circumstances, you should leave the cluster in this state after your troubleshooting is complete (successfully or not). Windows Firewall is an important security measure that is highly recommended for all environments, even those well protected on the perimeter level.

Below is the listing of the Windows Firewall exception properties. This exception is created by default when Windows Failover Clustering feature is installed. This means that the exception is in place even before the node is joined to the cluster.

netsh advfirewall firewall show rule name="Failover Clusters (UDP-In)" verbose

Rule Name:                            Failover Clusters (UDP-In)
----------------------------------------------------------------------
Description:                          Inbound rule for Failover Clusters to allow internal cluster communication by the cluster virtual network adapter. [UDP 3343]
Enabled:                              Yes
Direction:                            In
Profiles:                             Domain,Private,Public
Grouping:                             Failover Clusters
LocalIP:                              Any
RemoteIP:                             Any
Protocol:                             UDP
LocalPort:                            3343
RemotePort:                           3343
Edge traversal:                       No
Program:                              System
InterfaceTypes:                       Any
Security:                             NotRequired
Rule source:                          Local Setting
Action:                               Allow

Ok.

If for whatever reason, Windows Firewall settings in your environment block the intra-cluster communications, you'd want to make sure your exceptions have the same or less restrictive settings.

Note

This exception is enabled and applies to all network profiles. Also, this is not the only exception created and required by Failover Clustering feature. Lack of other exceptions can cause similar problems in different areas of Clustering functionality.


See Also

The following articles describe similar yet different scenarios.