Interpreting the Cluster Log |
This section includes cluster log output from two nodes when the intracluster network connection is broken. In this cluster, the intracluster network connection is a single point of failure.
Node 2 is the Quorum owner when the intracluster network connection is broken.
The following entry indicates that there has been a loss of communications between nodes. The Cluster service initiates the holding I/O operation when there has been a loss of communications.
00000534.00000500::1999/10/21-23:09:01.999 [NM] Holding I/O.
.
.
.
00000534.0000057c::1999/10/21-23:09:02.108 [NM] Checking if we own the
quorum resource.
.
.
.
00000534.0000057c::1999/10/21-23:09:02.124 [FM] Successfully arbitrated
quorum resource a83b4084-3391-4618-890e-8794d4df923b.
.
.
.
00000534.00000500::1999/10/21-23:09:04.905 [ClMsg] Received interface
unreachable event for node 1 network 1
00000534.00000500::1999/10/21-23:09:04.905 [ClMsg] Received interface
unreachable event for node 1 network 2
00000534.0000052c::1999/10/21-23:09:04.905 [NM] Communication was lost
with interface 0bd641f7-7d8c-4d94-9279-d461846b299b (node: NODE1,
network: clients(1))
.
.
.
00000534.0000052c::1999/10/21-23:09:04.905 [NM] Communication was lost
with interface ddda464e-7c6d-4439-b27b-cd0da7957162 (node: NODE1,
network: interconnect)
.
.
.
00000534.0000057c::1999/10/21-23:09:09.123 [NM] Resuming I/O.
.
.
.
00000534.0000057c::1999/10/21-23:09:09.123 [EP] Nodes down event
received
.
.
.
00000534.00000464::1999/10/21-23:09:09.139 [DM] DmpEventHandler - Node
is down, turn quorum logging on...
The following log entries are from node 1 and were generated for the same occurrence: the loss of the "interconnect" connection for cluster communications. The following entries, which establish that the network interface is unavailable, are the first indications. Declaring the other node to be down because the interface is unavailable triggers the regroup events noted below as RGP.
00000404.000004e4::1999/10/21-23:10:38.039 [ClMsg] Received interface
unreachable event for node 2 network 2
00000404.00000590::1999/10/21-23:10:38.039 [NM] Communication was lost
with interface 198ffe74-b7b9-41e5-b95a-25f618eb0c43 (node: NODE2,
network: interconnect)
00000404.000004e4::1999/10/21-23:10:42.914 [ClMsg] Received node down
event for node 2, epoch 0
.
.
.
00000404.00000374::1999/10/21-23:10:46.711 [NM] Checking if we own the
quorum resource.
In the following entry, "error 1" means "Incorrect Function." The Cluster service on this node could not read the partition information from the quorum disk prior to asserting a reservation. This is because the other node had reserved the quorum disk after the successful bus reset noted several entries earlier.
00000388.0000059c::1999/10/21-23:10:50.351 Physical Disk <Disk E:>:
[DiskArb]Failed to write (sector 12), error 1.
In the following entry, "status 1" means that arbitration for drive E:, the quorum disk, failed. That is, the other node successfully defended its reservation on the disk. The second and third entries following also report the failure:
00000388.0000059c::1999/10/21-23:10:50.351 Physical Disk <Disk E:>:
[DiskArb]Arbitrate returned status 1.
.
.
.
00000404.00000374::1999/10/21-23:10:50.351 [FM] Failed to arbitrate
quorum resource a83b4084-3391-4618-890e-8794d4df923b, error 1.
.
.
.
00000404.00000374::1999/10/21-23:10:50.351 [RGP] Node 1: REGROUP ERROR:
arbitration failed.
Because arbitration has failed and the nodes are partitioned, the Cluster service on this node shuts down in order to quit participating in the cluster.
00000404.00000374::1999/10/21-23:10:50.351 [NM] Halting this node due to
membership or communications error. Halt code = 1000
.
.
.
00000388.000003cc::1999/10/21-23:10:51.117 [RM] Going away, Status = 1,
Shutdown = 0.
.
.
.
00000388.0000052c::1999/10/21-23:10:51.148 [RM] NotifyChanges shutting
down.