Interpreting the Cluster Log

Previous Topic Next Topic

Anatomy of a Cluster Log Entry

Though cluster log entries might seem impenetrable at first glance, they are actually fairly easy to parse. Consider the following entry, the first line in the body of a typical cluster log:

378.32c::1999/06/09-18:00:18.874 Cluster service started - Cluster Node 

  Version 3.2051


Its main elements, common to every line of the log, include the following, starting from the beginning of the entry:

There are two types of cluster log entries: component event log entries and resource DLL log entries.

Component Event Log Entries

The Cluster service comprises a number of components, such as the Database Manager and the Global Update Manager. The logging of their interactions is what makes the cluster log such a powerful diagnostic tool.

Here is a typical example of a cluster log entry for a component event:

378.380::1999/06/09-18:00:50.881 [NM] Forming cluster membership.


Entries describing component events follow the process/thread ID and timestamp with the following:


Meanings of Abbreviations

Cluster log abbreviations for components and node states are shown in Table 20.1.

Table 20.1 Cluster Log Abbreviations for Components and Node States

Abbreviation Node state or component
[API] API support. These entries come from the Cluster service component that provides support for the Server Cluster API.
[ClMsg] Cluster messaging. The component that Regroup (also known as Membership Manager — see later in this table) uses to send and receive its messages.
[ClNet] Cluster network engine. Generic code to determine a node's network configuration.
[CP] Checkpoint Manager. If a resource has its registry key registered for checkpointing, the Checkpoint Manager monitors any changes to the key while the resource is online and writes a checkpoint to the quorum disk whenever there is a change to the registered key. On the node to which the resource is being failed over, the resource key in the registry is updated with the resource key's checkpoint before the resource is brought online.
[CS] Cluster service. This abbreviation is assigned to messages that come out of the Cluster service rather than one of its components.
[DM] Database Manager. The agent through which other components read or make changes to the cluster configuration database.
[EP] Event Processor. Components of the Cluster service register with the Event Processor to receive internal cluster events, such as a node's going up or down.
[FM] Failover Manager. Coordinates the moving of a group from one node to another based on failure criteria specified by the group's properties.
[GUM] Global Update Manager. A cluster-wide, broadcast-like remote procedure call (RPC) mechanism used to distribute information to all nodes in the cluster.
[INIT] The initial state of a node prior to joining or forming a cluster.
[JOIN] The node state that follows [INIT] when the node attempts to join a cluster. If the join operation succeeds, the state of the node then moves to cluster member.
[LM] Log Manager. Maintains the quorum log.
[MM] Membership Manager, also known and written to the cluster log as Regroup ([RGP]). See [RGP] in this table.
[NM] Node Manager. Keeps track of the state of other nodes in the cluster as well as maintaining the cluster-wide network configuration.
[OM] Object Manager. Maintains an in-memory database of entities, or objects (nodes, networks, groups, and so on). Each object has an associated type and a set of methods with which other components can manipulate it. Each cluster object is represented in the Object Manager space. The Object Manager does not differentiate between types of objects.
[RGP] Regroup, also known and written to the cluster log as Membership Manager ([MM]). Tracks which nodes are members of the cluster. Regroup writes entries to the log during initialization, form operations, and join operations, and when cluster membership changes.
[RM] Resource Monitor. Any of the processes (instances of Resrcmon.exe) of the Cluster service that actually monitor individual resources.

Resource DLL Log Entries

Because resource groups are the basic unit of failover, resource DLL entries are key to understanding cluster activity. The following entry is a cluster log entry for a resource DLL event, in this case one of the entries from the disk arbitration process.

15c.458::1999/06/09-18:00:47.897 Physical Disk <Disk D:>: [DISKARB]

  Arbitration Parameters (1 9999).


Instead of listing an abbreviated component name between the timestamp and event description as component log entries do, entries describing resource DLL events list the following information:

The event description in this example is "[DISKARB] Arbitration Parameters (1 9999)."

Meanings of State Codes and Status Codes

Interpreting state and status codes is crucial to deciphering the cluster log. Doing so is not difficult. The following two procedures tell you how.

To find the meaning of the status codes

net helpmsg [error_number]


For example, in the following entry, "error 5" is the error number. Using net hlpmsg returns "Access is denied." The error indicates that the problem has to do with permissions.

388.4e8::1999/06/09-20:20:57.281 [NM] Received advice that node 2 has

failed with error 5.


Remember that you still need to study the context of the error to discover its cause.

To find the meaning of state codes

  1. From the event description in the cluster log, note the type of object — group, resource, node state, network, or net interface — associated with the entry.

  2. note-icon

    Note

    In Windows 2000, the resource name is logged with its GUID when the Cluster service is started. You can expect to find the GUID and the resource name associated in a single entry during resource creation if these log entries have not been overwritten.

    You can also find the resource name in the registry, in the resource's subkey, which is identified by the resource's GUID. For more information about finding the resource name in the registry, see "Identifying GUIDs in the Registry" later in this chapter.

  3. In "State Codes" later in this chapter, find the meaning in the appropriate table.

For example, in the following entry, "state 129" means ClusterResourceOnlinePending.

388.398::1999/06/09-18:07:45.295 [FM] FmpRmOnlineResource: Returning.

  Resource 254ef0e8-1937-11d3-b3fe-00a0c986aa14, state 129, status 997.


© 1985-2000 Microsoft Corporation. All rights reserved.