Network Failure Detection and Recovery in a Two-Node Server Cluster

ID: Q242600


The information in this article applies to:
  • Microsoft Windows 2000 Advanced Server


SUMMARY

The Windows 2000 Cluster service runs a sophisticated algorithm to detect the availability of network interfaces. Also, the Plug and Play functionality of Windows 2000 detects disconnected network cables and connectivity problems between the network adapter and the device it is connected to, such as a hub or a switch. This article describes the network failure detection and recovery process on a two-node Windows 2000 Server Cluster.


MORE INFORMATION

The Cluster service detects the health of the network interfaces on your server cluster by sending a heartbeat from one node in the cluster to another node, and by monitoring node operational status information. Heartbeats are single User Datagram Protocol (UDP) packets exchanged between server cluster Node Managers every 1.2 seconds to confirm that each network interface is still up.

If the heartbeat packet is not received within two heartbeat periods, and the Local Area Network (LAN) to which the server cluster is connected to is configured for client to cluster communication, then the Cluster service tests the ability of each node to communicate with external hosts. Note that external hosts, by this definition, are computers that have an open connection to the cluster node, and exist on the same subnet. A commonly used external host would be the local router (default gateway).

The Cluster service tests LAN connectivity by using Internet Control Message Protocol (ICMP) echo requests to determine the scope of the network interface failure. For example, if the nodes on your server cluster are unable to communicate with each other, but one of the nodes is able to communicate with an external host, then the network interface remains up, and that node, if designated a possible owner, takes ownership of the cluster resources that are dependent on client LAN connectivity. Since the use of ICMP echo requests consumes LAN resources, they are used only as a secondary method of determining a failure. Server cluster network interfaces that are configured only for private communication between nodes behave differently when a LAN failure is detected. Because of this, the private LAN should be isolated, such that all cluster nodes are the only computers connected to the segment, and that only one LAN resides on the segment. Other private LANs for the same cluster must be isolated on a different segment. To create the isolated segment, you may use a hub, or in the case of a two-node server cluster, you may use a crossover cable.

Based on these requirements, there are no external hosts for use in determining the extent of the failure. If there is no alternate LAN for private cluster communication, the Cluster service must use the quorum device to arbitrate which node should remain up and running. Otherwise, an alternate available LAN is used for private cluster communications. Note that this process does not take into account the status of LANs designated for client use only.

Network Interface States

Unavailable

The owning node is down.

Failed

Reports that other interfaces on the LAN can communicate with each other or with external hosts, while the local interface cannot. The possible causes for this state are:
  • Network adapter failure.


  • Network adapter driver failure.


  • Local cable failure.


  • Port failure on the device that the network adapter is connected to.


Unreachable

Cannot communicate with at least one other interface whose state is not Failed, and/or not Unavailable.

Up

Can communicate with all other interfaces on the LAN whose states are not Failed, and/or not Unavailable. This is the normal operational state.

Network States

Unavailable

All interfaces defined on this cluster network are Unavailable.

Down

All network interfaces defined on this cluster network have lost communication with each other and with all known external hosts. All connected network interfaces on up nodes are in either the Failed or the Unreachable state. Therefore, all Transport Control Protocol/Internet Protocol (TCP/IP) address resources that are defined on the same subnet, and all resources that depend on these resources, do not work and are unavailable on the LAN.

Partitioned

One or more network interfaces are in the Unreachable state, but at least two interfaces can still communicate with each other or with an external host.

NOTE: This only applies to server clusters that have two or more nodes.

Up

All network interfaces defined on this cluster network that are not Failed and are not Unavailable can communicate. This is the normal operational state. In the following examples, there is only one LAN in the server cluster which is configured for client to public communication, and this LAN is lost.

NOTE: Disabling media sense on each node in the cluster affects its behavior, and this behavior is noted in the examples listed below. For additional information about disabling media sense, click the article number below to view the article in the Microsoft Knowledge Base:
Q239924 How to Disable Media Sense for TCP/IP in Windows 2000

Node A and Node B

Scenario

  • Node A and node B lose communication.


  • Node B can communicate with an external host.


  • Node A cannot communicate with any external hosts.


Results

  • The node A network interface state is Unreachable, Failed and then this network interface disappears from Cluster Administrator.


  • The node B network interface state is Unreachable, and then Up.


  • The Network state is Up.


  • Any resource groups with TCP/IP address resources dependent on the network interface that has failed, fail over to node B.


Node A and Node B

Scenario

  • Node A and node B lose communication.


  • Node A and node B cannot communicate with any external hosts.


Results

  • The state of both node A and node B network interfaces is Unreachable, and they disappear from Cluster Administrator.


  • The Network state is Down, and the network disappears from Cluster Administrator. When the LAN connection is restored, this LAN inherits the default network role which is to be used for both client and private communication. If something different is needed, it must be modified manually.


  • No resource groups fail over. TCP/IP address resources dependent on that network fail, and all resources that are dependent on that TCP/IP address are taken offline.


Results with Media Sense Disabled

  • Both network interfaces are Unreachable until network connectivity can be re-established.


  • Network state remains Down until the LAN connection is restored. This retains the network role configuration.


  • The resources remain online.


NOTE: In the process of doing a "rolling" upgrade from a Microsoft Windows NT Server 4.0, Enterprise Edition Cluster Server to a Windows 2000 Server Cluster, there will be a point when you will have a Windows 2000 node and a Windows NT 4.0 node. In this case, the Windows 2000 node uses the Windows NT 4.0 interface state algorithm. When all nodes are running Windows 2000, they will use the Windows 2000 interface state algorithm. For additional information about the Windows NT 4.0 interface state algorithm, click the article number below to view the article in the Microsoft Knowledge Base:
Q176320 Impact of Network Adapter Failure in a Cluster

Additional query words:

Keywords : kbnetwork
Version : WINDOWS:2000
Platform : WINDOWS
Issue type : kbinfo


Last Reviewed: January 27, 2000
© 2000 Microsoft Corporation. All rights reserved. Terms of Use.