4.5  Debugging Strategies

While each problem in a device driver is different in the mechanism necessary to identify the exact problem, there are several initial debugging procedures that can be used to find a starting point in identifying the problem. These techniques include using system bug check information, obtaining stack trace information, and checking hang conditions.

Interpreting System Bug Check Information

When Windows NT encounters a condition that compromises safe system operation, the system halts and displays a blue character-mode screen. This screen is commonly referred to as the “blue screen” or STOP screen. Windows NT attempts to display as much information as possible about the current state of the machine when it encountered the error. If crash dumps were enabled on the system, a crash dump file is created that can be used at a later time for more detailed debugging. If a debugger is attached and active, the system causes a breakpoint so the debugger can be used to investigate the crash.

The following example of a STOP screen contains several sections of information that will be discussed in detail. The sections are the STOP information at the top of the screen, information about the loaded modules in the system, basic stack trace information, and information regarding the status of crash dumps and kernel debugger activity.

*** STOP: 0x0000001E (0x80000003,0x80106fc0,0x8025ea21,0xfd6829e8)

Unhandled Kernel exception c0000047 from fa8418b4 (8025ea21,fd6829e8)

 

Dll Base  Date Stamp - Name               Dll Base  Date Stamp - Name

80100000  2be154c9   - ntoskrnl.exe       80400000  2bc153b0   - hal.dll

80258000  2bd49628   - ncrc710.sys        8025c000  2bd49688   - SCSIPORT.SYS 

80267000  2bd49683   - scsidisk.sys       802a6000  2bd496b9   - Fastfat.sys

fa800000  2bd49666   - Floppy.SYS         fa810000  2bd496db   - Hpfs_Rec.SYS

fa820000  2bd49676   - Null.SYS           fa830000  2bd4965a   - Beep.SYS

fa840000  2bdaab00   - i8042prt.SYS       fa850000  2bd5a020   - SERMOUSE.SYS

fa860000  2bd4966f   - kbdclass.SYS       fa870000  2bd49671   - MOUCLASS.SYS

fa880000  2bd9c0be   - Videoprt.SYS       fa890000  2bd49638   - NCR77C22.SYS

fa8a0000  2bd4a4ce   - Vga.SYS            fa8b0000  2bd496d0   - Msfs.SYS

fa8c0000  2bd496c3   - Npfs.SYS           fa8e0000  2bd496c9   - Ntfs.SYS

fa940000  2bd496df   - NDIS.SYS           fa930000  2bd49707   - wdlan.sys

fa970000  2bd49712   - TDI.SYS            fa950000  2bd5a7fb   - nbf.sys

fa980000  2bd72406   - streams.sys        fa9b0000  2bd4975f   - ubnb.sys

fa9c0000  2bd5bfd7   - mcsxns.sys         fa9d0000  2bd4971d   - netbios.sys

fa9e0000  2bd49678   - Parallel.sys       fa9f0000  2bd4969f   - serial.SYS

faa00000  2bd49739   - mup.sys            faa40000  2bd4971f   - SMBTRSUP.SYS

faa10000  2bd6f2a2   - srv.sys            faa50000  2bd4971a   - afd.sys

faa60000  2bd6fd80   - rdr.sys            faaa0000  2bd49735   - bowser.sys

 

Address  dword dump                                     Dll Base - Name

801afc20 80106fc0 80106fc0 00000000 00000000 80149905 : fa840000 - i8042prt.SYS

801afc24 80149905 80149905 ff8e6b8c 80129c2c ff8e6b94 : 8025c000 - SCSIPORT.SYS

801afc2c 80129c2c 80129c2c ff8e6b94 00000000 ff8e6b94 : 80100000 - ntoskrnl.exe

801afc34 801240f2 80124f02 ff8e6df4 ff8e6f60 ff8e6c58 : 80100000 - ntoskrnl.exe

801afc54 80124f16 80124f16 ff8e6f60 ff8e6c3c 8015ac7e : 80100000 - ntoskrnl.exe

801afc64 8015ac7e 8015ac7e ff8e6df4 ff8e6f60 ff8e6c58 : 80100000 - ntoskrnl.exe

801afc70 80129bda 80129bda 00000000 80088000 80106fc0 : 80100000 - ntoskrnl.exe

 

Kernel Debugger Using: COM2 (Port 0x2f8, Baud Rate 19200)

Restart and set the recovery options in the system control panel or the

/CRASHDEBUG system start option. If this message reappears, contact

your system administrator or technical support group.

 

There is a significant amount of information available for identifying the problem that caused the system to halt. This information can be used in conjunction with WinDbg to identify the problem.

At the top of the screen is the bug check code labeled STOP. This code will vary depending on the individual problem that caused the system to halt. The most common bug check codes are:

Bug Check Code Definition
0x0000000A IRQL_NOT_LESS_OR_EQUAL
0x0000001E KMODE_EXCEPTION_NOT_HANDLED
0x0000007F UNEXPECTED_KERNEL_MODE_TRAP

A bug check code of IRQL_NOT_LESS_OR_EQUAL generally indicates a software failure by a driver or other system component. This code indicates that an attempt was made to execute an operation an improper IRQL. This is generally caused by calling routines that are invalid at the IRQL of the caller, by using invalid addresses to system routines, or by attempting to touch pageable memory at raised IRQL.

A bug check code of KMODE_EXECPTION_NOT_HANDLED indicates that an exception taken in kernel mode was not handled. This bugcheck can happen for any number of different exceptions than can occur. Some of the most common include 0xc0000005 (access violation) and 0x80000003 (a breakpoint was encountered without a kernel debugger being attached to the system).

A bug check code of UNEXPECTED_KERNEL_MODE_TRAP indicates that a software condition too serious to continue has been encountered. Examples include divide by zero, a corrupt task state segment, or a fault occurring while processing a fault (this is known as a double fault condition).

The four values following the STOP message are parameters to KeBugCheckEx, the support routine called when the system must halt. For each bug check code, the parameters will vary. The following is a description of what each parameter means for the most common bug check codes:

Code Parameter 1 Parameter 2 Parameter 3 Parameter 4
0x0000000A Memory referenced IRQL Value 0 - Read
1 - Write
Address that referenced the memory.
0x0000001E Exception code Address where the exception occurred. Parameter 0 of the exception. Parameter 1 of the exception.
0x0000007F Trap code Not used Not used Not used

Following the bug check information is information about each loaded driver or base module in the system. This information consists of the base address where each module is loaded, a hexadecimal representation of the date stamp of the binary image, and the name of the driver or base component. The base address information can be interpreted to determine in what image an address is found. This information is useful in an attempt to determine which driver contains the address where, for example, a faulting instruction was executed.

Following the list of loaded drivers in the system is a brief stack trace. This information indicates what drivers and routines were being executed when the system failed. However, it is important to note that the last calls are not necessarily the cause of the system failure. The last calls on the stack trace can be system routines attempting to handle or process the error condition.

The final layer of information displayed is information regarding kernel debugger and crash dump status. This information indicates if the target side of kernel debugging is now active. If a crash dump file has been generated, that information will be displayed as well.

Debugging a System Bug Check

If a system has bug checked, there are steps to obtain useful information for further debugging. These steps include obtaining the stack trace and getting trap frame information.

When the debugger has connected and the system is accessible from the command window of the host machine, type kv. This command displays a verbose kernel trace from the target machine. The following is a sample trace:

ChildEBP RetAddr  Args to Child

8013ed5c 801263ba 00000000 00000000 e12ab000 NT!_DbgBreakPoint (FPO: [0,0,0])

8013eecc 801389ee 0000000a 00000000 0000001c NT!_KeBugCheckEx+0x194

8013eecc 00000000 0000000a 00000000 0000001c NT!_KiTrap0E+0x256 (FPO: [0,0] TrapFrame @ 8013eee8)

8013ed5c 801263ba 00000000 00000000 e12ab000

8013ef64 00000246 fe551aa1 ff690268 00000002 NT!_KeBugCheckEx+0x194

 

The information from a kernel trace includes the list of calling routines, the parameters of each call, and, if a routine stored a trap frame, a trap frame address. If a trap frame was stored, the information that caused the trap can be retrieved by using the !trap command. In the preceeding example a trap frame was stored by NT!_KiTrap0E (which is very common) at 8013eee8. When the trap frame has been retrieved, a !kb will generate a stack trace for that trap frame.

Taking a page fault at raised IRQL or at an otherwise inappropriate is a frequent cause of bug check conditions. There are many operations that can cause this to happen including calling support routines at an IRQL level higher than permitted for that routine, using incorrect pointers, or touching pageable memory at a raised IRQL. A page fault condition generally creates a trap frame and a fault situation that will require additional investigation to find the cause of the problem.

A bug check of 0x0000001E (KMODE_EXCEPTION_NOT_HANDLED) often does not display full stack information and, thus, requires additional investigation. See Debugging Unhandled Exceptions.

Debugging Unhandled Exceptions

A bugcheck condition of KMODE_EXCEPTION_NOT_HANDLED might require additional investigation if a full stack trace could not obtained through normal stack trace operations. Use the following procedure to get the necessary information to identify the problem:

1.Look for the first parameter to NT!PspUnhandledExceptionInSystemThread. Use the kb command, which displays parameters in the stack trace to find this value. For optimized RISC machines which might not have the parameters on the stack, see Manually Obtaining Call Stack Trace Information on RISC Machines for more information on how to obtain this parameter.

2.The first parameter to NT!PspUnhandledExceptionInSystemThread is a pointer to a structure which contains pointers to an except. A dd command on that address will display the necessary data.

3.The first value retrieved is an exception record. The exception record can be displayed by using the !exr command. The second value retrieved is a context record that can be displayed by using the !cxr command.

4.After executing the !cxr command, the !kb command displays a stack trace based on the context record information. This indicates the calling stack when the unhandled exception occurred.

The stack trace information that is returned from the exception record information displays a normal stack trace with parameters that can then be used to debug the problem and find the offending operation.

Manually Obtaining Call Stack Trace Information on RISC Machines

On an optimized version of Windows NT running on a RISC system, such as the free build of Windows NT, parameters might not be stored on the stack. Instead, they can be stored in registers for performance. Attempting to debug a problem on a RISC machine presents unique problems, as a kb command might not reveal the correct parameters passed to each function.

To unwind the stack to find where various parameters are stored, it is necessary to understand of the assembly code for each platform. The movement of the parameters in and out of registers, as well as the values stored in the them, needs to be tracked. For more information on the assembly language for each platform and how that processor architecture handles call stacks and parameters, see the appropriate processor reference from the manufacturer.

Parameters are stored, in first-to-last order, in registers a0 through a3 and, if necessary, registers t0 through t7 on MIPS machines. On Alpha machines, parameters are stored first-to-last in registers a0 to a5 and, if necessary, registers t0 through t7. On Power PC machines, parameters are stored in first-to-last order in registers r3 through r31.

Unassemble the function to a point where the parameter values can be identified in memory or until a function call on the stack is made without the register being saved. This gives you and indication of where each parameter was stored and what its value was. Do this for each module as you trace downward in the system. Some of the parameters can have been stored on the stack before use, making retrieving their values relatively simple.

Checking Hang Conditions

Sometimes a system can hang without breaking into the debugger. The symptoms can vary from no mouse or keyboard response to the video update and no I/O response.

The following is a starting point for debugging a hang condition:

1.Break into the target system by pressing CTRL+C on the host debugger. This causes the target system to halt as if it had encountered a breakpoint. Use !process to identify the currently running process.

2.The most useful information is the time values, the handle count, and the thread status information. A high time value can indicate that this is a suspect process. If the current process is idle, it indicates that the machine is either truly idle or in an indeterminate state. If this process does not seem to be the problem, try using !process with its other options to get more information on other processes in the system or with more detail.

3.After a suspect process has been identified, use !process <process> 7 to show the kernel stacks for each thread in the process. This can indicate what the problem could be in kernel mode and what the suspect process is calling.

To identify other information about a possible hang condition, the following table contains some additional useful commands:

Command Usage in debugging hang conditions
!ready Identifies threads in a ready condition in order of priority.
!locks Identifies any resource locks.
!vm Displays virtual memory usage.
!poolused Checks pool allocation.
On a checked build with pool tags this can identify excessive allocations.
!memusage Checks physical memory status.
!heap Checks the validity of the heap.
!irpfind Displays any information about pending IRP requests.

To determine if a single process is causing the machine to hang, set a breakpoint at KiSwapContext. If this breakpoint is hit, then the system is scheduling other processes. If the breakpoint is not hit, then a single process is causing the hang.