Interpreting the Cluster Log

Anatomy of a Cluster Log Entry

Though cluster log entries might seem impenetrable at first glance, they are actually fairly easy to parse. Consider the following entry, the first line in the body of a typical cluster log:

378.32c::1999/06/09-18:00:18.874 Cluster service started - Cluster Node

  Version 3.2051

Its main elements, common to every line of the log, include the following, starting from the beginning of the entry:

The IDs of the process and thread issuing the log entry. These two IDs are concatenated, separated by a period. In the preceding example, the Process ID is 378, and the thread ID is 32c.
The timestamp in the following format, in Greenwich Mean Time (GMT):
yyyy/mm/dd-hh:mm:ss.sss

where:
- yyyy/mm/dd represents the year, month, and day.
- hh:mm:ss.sss represents the time of day on a 24-hour clock, carried out to the thousandths of a second.
In the preceding example, the time is 18.874 seconds past 6 o'clock in the evening, GMT.
The event description, such as "Cluster service started."

There are two types of cluster log entries: component event log entries and resource DLL log entries.

Component Event Log Entries

The Cluster service comprises a number of components, such as the Database Manager and the Global Update Manager. The logging of their interactions is what makes the cluster log such a powerful diagnostic tool.

Here is a typical example of a cluster log entry for a component event:

378.380::1999/06/09-18:00:50.881 [NM] Forming cluster membership.

Entries describing component events follow the process/thread ID and timestamp with the following:

One of two types of abbreviations, enclosed in square brackets, such as [NM] or [JOIN]. The two types of abbreviations are:
- The component that wrote the event to the cluster log (in this entry, it is [NM], the Node Manager).
- The state of the node at the time the entry was written to the cluster log.
In an abbreviation, the node state reflects the operation that is in progress, such as [INIT] or [JOIN].

A cluster log abbreviation can combine a component and state, as in [NMJOIN], which combines the Node Manager abbreviation and the join operation:

388.55c::1999/06/09-18:08:25.621 [NMJOIN] Processing request by node

2 to begin joining.

The event description (for example, "Found the quorum resource 254ef0e8-1937-11d3-b3fe-00a0c986aa14."). In the example, "254ef0e8-1937-11d3-b3fe-00a0c986aa14" is the globally unique identifier (GUID) for the resource that the component found. When the Cluster service creates a resource, one of the entries typically names the resource, which makes it easy to subsequently identify the resource in event descriptions by its GUID. The following is one such entry:

378.380::1999/06/09-18:00:51.193 [FM] Name for Resource

254ef0e5-1937-11d3-b3fe-00a0c986aa14 is 'Cluster IP Address'.

Meanings of Abbreviations

Cluster log abbreviations for components and node states are shown in Table 20.1.

Table 20.1 Cluster Log Abbreviations for Components and Node States

Abbreviation	Node state or component
[API]	API support. These entries come from the Cluster service component that provides support for the Server Cluster API.
[ClMsg]	Cluster messaging. The component that Regroup (also known as Membership Manager — see later in this table) uses to send and receive its messages.
[ClNet]	Cluster network engine. Generic code to determine a node's network configuration.
[CP]	Checkpoint Manager. If a resource has its registry key registered for checkpointing, the Checkpoint Manager monitors any changes to the key while the resource is online and writes a checkpoint to the quorum disk whenever there is a change to the registered key. On the node to which the resource is being failed over, the resource key in the registry is updated with the resource key's checkpoint before the resource is brought online.
[CS]	Cluster service. This abbreviation is assigned to messages that come out of the Cluster service rather than one of its components.
[DM]	Database Manager. The agent through which other components read or make changes to the cluster configuration database.
[EP]	Event Processor. Components of the Cluster service register with the Event Processor to receive internal cluster events, such as a node's going up or down.
[FM]	Failover Manager. Coordinates the moving of a group from one node to another based on failure criteria specified by the group's properties.
[GUM]	Global Update Manager. A cluster-wide, broadcast-like remote procedure call (RPC) mechanism used to distribute information to all nodes in the cluster.
[INIT]	The initial state of a node prior to joining or forming a cluster.
[JOIN]	The node state that follows [INIT] when the node attempts to join a cluster. If the join operation succeeds, the state of the node then moves to cluster member.
[LM]	Log Manager. Maintains the quorum log.
[MM]	Membership Manager, also known and written to the cluster log as Regroup ([RGP]). See [RGP] in this table.
[NM]	Node Manager. Keeps track of the state of other nodes in the cluster as well as maintaining the cluster-wide network configuration.
[OM]	Object Manager. Maintains an in-memory database of entities, or objects (nodes, networks, groups, and so on). Each object has an associated type and a set of methods with which other components can manipulate it. Each cluster object is represented in the Object Manager space. The Object Manager does not differentiate between types of objects.
[RGP]	Regroup, also known and written to the cluster log as Membership Manager ([MM]). Tracks which nodes are members of the cluster. Regroup writes entries to the log during initialization, form operations, and join operations, and when cluster membership changes.
[RM]	Resource Monitor. Any of the processes (instances of Resrcmon.exe) of the Cluster service that actually monitor individual resources.

Resource DLL Log Entries

Because resource groups are the basic unit of failover, resource DLL entries are key to understanding cluster activity. The following entry is a cluster log entry for a resource DLL event, in this case one of the entries from the disk arbitration process.

15c.458::1999/06/09-18:00:47.897 Physical Disk <Disk D:>: [DISKARB]

  Arbitration Parameters (1 9999).

Instead of listing an abbreviated component name between the timestamp and event description as component log entries do, entries describing resource DLL events list the following information:

Resource type ("Physical Disk")
Resource name ("<Disk I:>")

The event description in this example is "[DISKARB] Arbitration Parameters (1 9999)."

Meanings of State Codes and Status Codes

Interpreting state and status codes is crucial to deciphering the cluster log. Doing so is not difficult. The following two procedures tell you how.

To find the meaning of the status codes

At the command prompt, type:

net helpmsg [error_number]

For example, in the following entry, "error 5" is the error number. Using net hlpmsg returns "Access is denied." The error indicates that the problem has to do with permissions.

388.4e8::1999/06/09-20:20:57.281 [NM] Received advice that node 2 has

failed with error 5.

Remember that you still need to study the context of the error to discover its cause.

To find the meaning of state codes

From the event description in the cluster log, note the type of object — group, resource, node state, network, or net interface — associated with the entry.

Note

In Windows 2000, the resource name is logged with its GUID when the Cluster service is started. You can expect to find the GUID and the resource name associated in a single entry during resource creation if these log entries have not been overwritten.

You can also find the resource name in the registry, in the resource's subkey, which is identified by the resource's GUID. For more information about finding the resource name in the registry, see "Identifying GUIDs in the Registry" later in this chapter.

In "State Codes" later in this chapter, find the meaning in the appropriate table.

For example, in the following entry, "state 129" means ClusterResourceOnlinePending.

388.398::1999/06/09-18:07:45.295 [FM] FmpRmOnlineResource: Returning.

  Resource 254ef0e8-1937-11d3-b3fe-00a0c986aa14, state 129, status 997.