Interpreting the Cluster Log |
Though cluster log entries might seem impenetrable at first glance, they are actually fairly easy to parse. Consider the following entry, the first line in the body of a typical cluster log:
378.32c::1999/06/09-18:00:18.874 Cluster service started - Cluster Node
Version 3.2051
Its main elements, common to every line of the log, include the following, starting from the beginning of the entry:
yyyy/mm/dd-hh:mm:ss.sss
where:
In the preceding example, the time is 18.874 seconds past 6 o'clock in the evening, GMT.
There are two types of cluster log entries: component event log entries and resource DLL log entries.
The Cluster service comprises a number of components, such as the Database Manager and the Global Update Manager. The logging of their interactions is what makes the cluster log such a powerful diagnostic tool.
Here is a typical example of a cluster log entry for a component event:
378.380::1999/06/09-18:00:50.881 [NM] Forming cluster membership.
Entries describing component events follow the process/thread ID and timestamp with the following:
In an abbreviation, the node state reflects the operation that is in progress, such as [INIT] or [JOIN].
A cluster log abbreviation can combine a component and state, as in [NMJOIN], which combines the Node Manager abbreviation and the join operation:
388.55c::1999/06/09-18:08:25.621 [NMJOIN] Processing request by node
2 to begin joining.
378.380::1999/06/09-18:00:51.193 [FM] Name for Resource
254ef0e5-1937-11d3-b3fe-00a0c986aa14 is 'Cluster IP Address'.
Cluster log abbreviations for components and node states are shown in Table 20.1.
Table 20.1 Cluster Log Abbreviations for Components and Node States
Abbreviation | Node state or component |
---|---|
[API] | API support. These entries come from the Cluster service component that provides support for the Server Cluster API. |
[ClMsg] | Cluster messaging. The component that Regroup (also known as Membership Manager — see later in this table) uses to send and receive its messages. |
[ClNet] | Cluster network engine. Generic code to determine a node's network configuration. |
[CP] | Checkpoint Manager. If a resource has its registry key registered for checkpointing, the Checkpoint Manager monitors any changes to the key while the resource is online and writes a checkpoint to the quorum disk whenever there is a change to the registered key. On the node to which the resource is being failed over, the resource key in the registry is updated with the resource key's checkpoint before the resource is brought online. |
[CS] | Cluster service. This abbreviation is assigned to messages that come out of the Cluster service rather than one of its components. |
[DM] | Database Manager. The agent through which other components read or make changes to the cluster configuration database. |
[EP] | Event Processor. Components of the Cluster service register with the Event Processor to receive internal cluster events, such as a node's going up or down. |
[FM] | Failover Manager. Coordinates the moving of a group from one node to another based on failure criteria specified by the group's properties. |
[GUM] | Global Update Manager. A cluster-wide, broadcast-like remote procedure call (RPC) mechanism used to distribute information to all nodes in the cluster. |
[INIT] | The initial state of a node prior to joining or forming a cluster. |
[JOIN] | The node state that follows [INIT] when the node attempts to join a cluster. If the join operation succeeds, the state of the node then moves to cluster member. |
[LM] | Log Manager. Maintains the quorum log. |
[MM] | Membership Manager, also known and written to the cluster log as Regroup ([RGP]). See [RGP] in this table. |
[NM] | Node Manager. Keeps track of the state of other nodes in the cluster as well as maintaining the cluster-wide network configuration. |
[OM] | Object Manager. Maintains an in-memory database of entities, or objects (nodes, networks, groups, and so on). Each object has an associated type and a set of methods with which other components can manipulate it. Each cluster object is represented in the Object Manager space. The Object Manager does not differentiate between types of objects. |
[RGP] | Regroup, also known and written to the cluster log as Membership Manager ([MM]). Tracks which nodes are members of the cluster. Regroup writes entries to the log during initialization, form operations, and join operations, and when cluster membership changes. |
[RM] | Resource Monitor. Any of the processes (instances of Resrcmon.exe) of the Cluster service that actually monitor individual resources. |
Because resource groups are the basic unit of failover, resource DLL entries are key to understanding cluster activity. The following entry is a cluster log entry for a resource DLL event, in this case one of the entries from the disk arbitration process.
15c.458::1999/06/09-18:00:47.897 Physical Disk <Disk D:>: [DISKARB]
Arbitration Parameters (1 9999).
Instead of listing an abbreviated component name between the timestamp and event description as component log entries do, entries describing resource DLL events list the following information:
The event description in this example is "[DISKARB] Arbitration Parameters (1 9999)."
Interpreting state and status codes is crucial to deciphering the cluster log. Doing so is not difficult. The following two procedures tell you how.
To find the meaning of the status codes
net helpmsg [error_number]
For example, in the following entry, "error 5" is the error number. Using net hlpmsg returns "Access is denied." The error indicates that the problem has to do with permissions.
388.4e8::1999/06/09-20:20:57.281 [NM] Received advice that node 2 has
failed with error 5.
Remember that you still need to study the context of the error to discover its cause.
To find the meaning of state codes
Note
In Windows 2000, the resource name is logged with its GUID when the Cluster service is started. You can expect to find the GUID and the resource name associated in a single entry during resource creation if these log entries have not been overwritten.
You can also find the resource name in the registry, in the resource's subkey, which is identified by the resource's GUID. For more information about finding the resource name in the registry, see "Identifying GUIDs in the Registry" later in this chapter.
For example, in the following entry, "state 129" means ClusterResourceOnlinePending.
388.398::1999/06/09-18:07:45.295 [FM] FmpRmOnlineResource: Returning.
Resource 254ef0e8-1937-11d3-b3fe-00a0c986aa14, state 129, status 997.