Interpreting the Cluster Log

Previous Topic Next Topic

Techniques for Tracking the Source of a Problem

It's often helpful to go to the end of the log and work your way backward to find the oldest failure that caused your problem. However, the true cause of the failure might not be the first apparent cause you find when working backward. The initial failure can cause side effects that trigger other failures.

You can also use the following for tracking operations in the cluster log:

Timestamps

Scan the timestamps for breaks of one second or more to find periods of inactivity in the Cluster service. A break might indicate that all threads of activity have completed.

For example, consider the following entries from the logging of a cluster form operation and cluster join operation.

378.328::1999/06/09-18:00:52.006 [FM] FmpPropagateResourceState:

  resource 76d0bf88-5e2c-44b9-9bdb-3eb677487afb online event.

378.328::1999/06/09-18:00:52.006 [FM] FmpRmWorkItemHandler: Exit

378.3a8::1999/06/09-18:00:52.006 [GUM] GumSendUpdate:  Locker waiting

  type 0 context 11

378.3a8::1999/06/09-18:00:52.006 [GUM] Thread 0x3a8 UpdateLock wait on

  Type 0

378.3a8::1999/06/09-18:00:52.006 [GUM] DoLockingUpdate successful, lock

  granted to 1

378.3a8::1999/06/09-18:00:52.006 [GUM] GumSendUpdate: Locker dispatching

  seq 1673 type 0 context 11

378.3a8::1999/06/09-18:00:52.006 [GUM] GumpDoUnlockingUpdate releasing

  lock ownership

378.3a8::1999/06/09-18:00:52.006 [GUM] GumSendUpdate: completed update

  seq 1673 type 0 context 11

378.3c8::1999/06/09-18:00:53.053 [GUM] GumSendUpdate:  Locker waiting

  type 1 context 4098


The first two entries report the Failover Manager's initiation of global update, and the second-to-last entry reports the completion of the update. Note the break in time between this entry and the last entry.

Occasionally, timestamps violate chronological order. For example, note that in the following example from a Microsoft® Windows NT® 4.0 cluster, the entry with the timestamp 01-20:43:09.257 ("[GUM] Thread 0x7f UpdateLock wait on Type 1") appears out of order.

07b::01-20:43:09.304 [GUM] GumSendUpdate: completed update seq 1191

  type 1 context 0

07f::01-20:43:09.257 [GUM] Thread 0x7f UpdateLock wait on Type 1

07f::01-20:43:09.304 [GUM] DoLockingUpdate successful, lock granted to 2


This happened because the cluster log obtains the system time and then tries to acquire access to the log file (also known as acquiring the print lock). In this case, thread 07b acquired the print lock before thread 07f, which had the time but not the print lock.

To find why 07f didn't acquire the print lock when it got the time, scan for the 07f thread ID. The following two entries show that 07f was interrupted by 0ae in the middle of a GUM update:

07f::01-20:43:09.257 [GUM] GumSendUpdate:  Locker waiting type 1

  context 4098

0ae::01-20:43:09.265 [GUM] GumSendUpdate:  Locker waiting type 0

  context 8


GUIDs, Resources, and Groups

Because the Cluster service often doesn't include the resource's display name in entries written to the cluster log, you need to track resource GUIDs as well as display names. The entry reporting an offline call for a given resource reliably includes both the display name and the GUID, so it is a good resource for correlating a resource's GUID and display name.

For example, the following entry tells you that resource "254ef0e5-1937-11d3-b3fe-00a0c986aa14" is the Cluster IP Address resource:

378.130::1999/06/09-18:02:24.189 [FM] FmpOfflineResource: Offline

  resource <Cluster IP Address> <254ef0e5-1937-11d3-b3fe-00a0c986aa14>


Another dependable correlation of a resource's display name and GUID is the entry the Failover Manager [FM] writes during resource creation, which names the resource. The following is an example:

378.380::1999/06/09-18:00:51.209 [FM] Name for Resource 

  191c151b-a8e5-45a6-8330-d38669845e66 is 'testwinfile'.


The Cluster service's model for bringing resources online is to bring everything online as quickly as possible. If there are hundreds of resources, the Cluster service tries to bring them all online simultaneously.

Creating Groups and Resources

The following set of entries log successful creation of a group and a resource at startup of the Cluster service. In both cases, you must track the object by its GUID. Note, however, the entry that gives the display "Name for" the group or resource. Note also that the Failover Manager is the component responsible for creating and initializing groups and resources at startup.

In the first two entries, the Failover Manager creates the group in its memory and initializes it with the values for the group in the cluster registry key.

378.380::1999/06/09-18:00:51.209 [FM] Creating group 

  d103ed17-f7bc-4ff9-8a64-e692f4ecb663

378.380::1999/06/09-18:00:51.209 [FM] Initializing group 

  d103ed17-f7bc-4ff9-8a64-e692f4ecb663 from the registry.


The following is the "Name for" entry for the group.

378.380::1999/06/09-18:00:51.209 [FM] Name for Group 

  d103ed17-f7bc-4ff9-8a64-e692f4ecb663 is 'test group'.


The Failover Manager creates the list of the nodes that are the group's preferred owners.

378.380::1999/06/09-18:00:51.209 [FM] Group 

  d103ed17-f7bc-4ff9-8a64-e692f4ecb663 preferred owner 1.

378.380::1999/06/09-18:00:51.209 [FM] Group 

  d103ed17-f7bc-4ff9-8a64-e692f4ecb663 preferred owner 2.


The Failover Manager creates the list of the resources that the group owns.

378.380::1999/06/09-18:00:51.209 [FM] Group 

  d103ed17-f7bc-4ff9-8a64-e692f4ecb663 contains Resource 

  191c151b-a8e5-45a6-8330-d38669845e66.


Having created the group, the Failover Manager creates resource 191c151b-a8e5-45a6-8330-d38669845e66 and then initializes it with its values from the cluster database (the cluster key in the registry).

378.380::1999/06/09-18:00:51.209 [FM] Creating resource 

  191c151b-a8e5-45a6-8330-d38669845e66

378.380::1999/06/09-18:00:51.209 [FM] Initializing resource 

  191c151b-a8e5-45a6-8330-d38669845e66 from the registry.


The Failover Manager creates the list of possible owners for the resource.

378.380::1999/06/09-18:00:51.209 [FM] FmpAddPossibleEntry: adding node 1

  as possible host for resource 191c151b-a8e5-45a6-8330-d38669845e66.

378.380::1999/06/09-18:00:51.209 [FM] FmpAddPossibleEntry: adding node 2

  as possible host for resource 191c151b-a8e5-45a6-8330-d38669845e66.


The Failover Manager notes that the resource's dependencies have already been created.

378.380::1999/06/09-18:00:51.209 [FM] All dependencies for resource

  191c151b-a8e5-45a6-8330-d38669845e66 created.


Process and Thread IDs

Process and thread IDs help you trace activities and interactions. For example, in the section "Joining a Cluster (Unsuccessful Attempt)" later in the chapter, each thread spawned to connect to the sponsoring node has a distinct thread ID.

When an instance of an executable file (such as the Cluster service and Resource Monitor) starts, it starts a new process and, therefore, gets a new process ID. Knowing this, you can examine the activity with which a process ID is associated in order to deduce which executable file's process it represents.

Tracing process and thread IDs is especially helpful for finding out why a particular process is taking too long.

One caveat, however: Related events can have different thread IDs in the cluster log, even though they report activity that occurs as a single logical operation.

GUM Updates and Sequence Numbers

One of the most frequently repeated patterns you'll find in cluster logs is that of a global update, performed by the Global Update Manager (GUM) when there is a change to the cluster database. The following group of entries is a typical global update.

The GUM uses a unique sequence number for each transaction, incrementing the sequence number by one for each new update. In this group of entries, "1657" is the sequence number. Note that this update propagates two changes: a membership change (type 2 context 17) and a resource-state change (type 0 context 8). For context numbers of GUM updates, see "Context Numbers" later in this chapter.

378.380::1999/06/09-18:00:50.897 [GUM] GumSendUpdate:  Locker waiting

  type 2 context 17

378.380::1999/06/09-18:00:50.897 [GUM] Thread 0x380 UpdateLock wait on

  Type 2

378.380::1999/06/09-18:00:50.897 [GUM] DoLockingUpdate successful, lock

  granted to 1

378.380::1999/06/09-18:00:50.897 [GUM] GumSendUpdate: Locker dispatching

  seq 1657    type 2 context 17

378.380::1999/06/09-18:00:50.897 [dm] DmBeginLocalUpdate Entry

378.380::1999/06/09-18:00:50.897 [dm] DmBeginLocalUpdate Exit,

  pLocalXsaction=0x000a0dd8 dwError=0x00000000

378.380::1999/06/09-18:00:47.928 [GUM] GumSendUpdate: Locker dispatching

  seq 1657    type 0 context 8

378.380::1999/06/09-18:00:47.928 [GUM] GumpDoUnlockingUpdate releasing

  lock ownership

378.380::1999/06/09-18:00:47.928 [GUM] GumSendUpdate: completed update

  seq 1657    type 0 context 8


GUM sequence numbers are useful for correlating the activities of two or more nodes in a cluster. For instance, a "Locker dispatching seq xxx" message has a corresponding "s_GumQueueLockingUpdate: dispatching seq xxx" in the other node's log, where "xxx" represents the same sequence number. This can help you synchronize time periods between two logs. In the following pair of examples, note the continuity of the sequence number ("seq 1732"). Note also that the timestamps for the two nodes are not exactly synchronized: The timestamp for the "GumSendUpdate: Dispatching" entry on the sending node is later than the timestamp for the "GumQueueLockingUpdate: Dispatching" entry on the receiving node. Thus, you need the sequence number to correlate the update between the two nodes.

The following entries were logged on the node issuing a GUM update:

378.390::1999/06/09-18:16:29.671 [GUM] GumSendUpdate: Dispatching 

  seq 1732    type 2 context 17 to node 2

.

.

.

378.390::1999/06/09-18:16:29.718 [GUM] GumSendUpdate: completed update

  seq 1732    type 2 context 17

.

.

.

378.390::1999/06/09-18:16:29.671 [GUM] GumSendUpdate: Dispatching 

  seq 1732    type 2 context 17 to node 2


The following entries were logged on the node receiving the same GUM update:

388.55c::1999/06/09-18:16:29.600 s_GumQueueLockingUpdate: dispatching

  seq 1732    type 2 context 17

.

.

.

388.55c::1999/06/09-18:16:29.662 s_GumQueueLockingUpdate: completed

  update seq 1732    type 2 context 17 result 0


Shared Locks and gdwQuoBlockingResources

When something in your cluster is preventing the Quorum resource from being taken offline (as in a failover or move of the Quorum resource from one node to the other), you can use the gdwQuoBlockingResources count of shared locks to identify the root cause of the problem.

A lock (called the quorum lock) prevents access to the Quorum resource by components other than the holder of the lock; there are two modes for the lock on the Quorum resource:

Clearly, acquisition of an exclusive lock must not supersede existing shared locks. That is, the Cluster service cannot acquire an exclusive lock and take the Quorum resource offline while components are in the middle of operations that require the Quorum resource to be online.

On the other hand, after existing operations that require access to the quorum device have completed their tasks, the Cluster service must have an opportunity to take the Quorum resource offline.

The Cluster service solves this problem with the gdwQuoBlockingResources global variable and a rule. The gdwQuoBlockingResources global variable is a count of the number of existing shared locks, which functions as follows:

The rule is that a request for an exclusive lock is queued behind existing shared locks.

If there is a request to take the Quorum resource offline and the count is not zero, the InterlockedIncrement and InterlockedDecrement entries include the count of currently shared locks.

The following "InterlockedIncrement/InterlockedDecrement" pair of entries bracket the acquisition of shared locks by the Network Name resource as the resource is taken offline. If the resource could not be taken offline, preventing the gdwQuoBlockingResources count from reaching zero, you would find the problem with the resource here, between the "InterlockedIncrement/InterlockedDecrement" pair of entries.

388.4e8::1999/06/09-20:20:56.640 [FM] FmpRmOfflineResource:

  InterlockedIncrement on gdwQuoBlockingResources for resource

  254ef0e6-1937-11d3-b3fe-00a0c986aa14

4b4.450::1999/06/09-20:20:56.906 Network Name <Cluster Name>: Taking

  resource offline...

4b4.450::1999/06/09-20:20:56.906 Network Name <Cluster Name>: Offline of

  resource continuing...

4b4.450::1999/06/09-20:20:57.078 Network Name <Cluster Name>: Failed to

  delete server name CLUSTER1, status 2114.

4b4.450::1999/06/09-20:20:57.078 Network Name <Cluster Name>: Failed to

  delete server name CLUSTER1, status 2114.

4b4.450::1999/06/09-20:20:57.078 Network Name <Cluster Name>: Deleted

  workstation name CLUSTER1 from transport 0.

4b4.450::1999/06/09-20:20:57.187 Network Name <Cluster Name>: Resource

  is now offline

388.4e8::1999/06/09-20:20:57.187 [FM] FmpRmOfflineResource:

  InterlockedDecrement on gdwQuoBlockingResources for resource

  254ef0e6-1937-11d3-b3fe-00a0c986aa14


The next section discusses what the cluster log reveals about cluster activities.

© 1985-2000 Microsoft Corporation. All rights reserved.