How to Track, Isolate, and Exterminate Bugs in Your Windows[TM]-based Applications

Dave Edson

{ewc navigate.dll, ewbutton, /Bcodeview /T"Click to open or copy the code samples from this article." /C"samples_3}

Since the introduction of the MicrosoftÒ WindowsÔ operating system version 3.0, programming for Windows1 has moved out of the hacker’s realm and into the mainstream. Programmers are getting used to writing multitasking, resource-sharing, event-driven code. There are excellent books that can teach you how to program in Windows, but not much is available to help you learn how to debug Windows-based applications. I want to share some of the techniques (and gotchas) I’ve learned that will make the Unrecoverable Application Error message go pick on some other app.

All Sorts of Bugs

Fatal Exits Some developers coding for Windows think fatal exits are the old-fashioned way to validate parameters. Fatal exits are also known as RIPs, for obvious reasons. RIPs are most often caused by an invalid handle. Passing a bogus window handle to a Windows API function or using a bogus memory handle causes the debugging version of Windows to RIP. Doing this in the retail version of Windows usually trashes some internal Windows memory. When your application dies a bit later down the road, you don’t know why.

UAEs Any Windows 3.0 user who has never experienced an Unrecoverable Application Error (UAE) must only play Solitaire. UAEs are caused by IntelÒ 80x86 exception traps. The most common causes are reading/writing to a block of memory that does not belong to you, exceeding segment bounds, executing invalid instructions, or dividing by zero. Divide-by-zero UAEs look slightly different because in this one case, Windows is nice enough to tell you that you tried a mathematically impossible task.

Trashed Data Trashing data is a lot easier than you think. You can trash three types of data: your own, some other application’s, or Windows data.

Trashing Windows data is a real show stopper, causing wonderful screen fireworks such as a UAE box with garbage in it. Windows 3.0 data is easy to trash. Version 3.1’s data is very difficult to trash because it has parameter-validation code.

Usually, it is quite hard to trash another application’s data (unless you are running in real mode, which doesn’t exist in Windows 3.1 anyway, so I’ll just pretend that real mode does not exist for the sanity of this article). Of the 8,192 available selectors in Windows, usually a thousand or so get used. That makes your odds about one in eight for trashing a selector address in use. Of those 1,000 or so selectors, most are read-only, so attempting to store new values in them will cause a UAE instead of trashed data anyway.

Hang/Drop to MS-DOS If you cause an infinite loop, or trash a critical interrupt vector or memory address, your machine will quietly stop working. Determining what caused the hang can be very tough, since you usually have to give your machine the three-finger salute, erasing all of your data (though it does give CHKDSK a reason to exist). Other times, you may cause a UAE during a critical section. Causing a UAE while Windows is in a critical section usually lands you at the MS-DOSÒ operating system prompt, which is your cue to reboot immediately (your MS-DOS2 data is most likely corrupt, and disk I/O may trash your hard disk).

Killing Others What if another application crashes when you run your application? Who is to blame? Most likely your application (this ensures that you’ll test your application even more thoroughly).

Compiling

The first rule of debugging is: don’t compile your application with any optimizations enabled. Optimized code can’t be easily traced through with a debugger, and you can’t verify if the code is compiled correctly. Once your application is completely and totally ready for release, recompile with optimizations, and then retest everything.

Some optimizations should never be used, in my opinion. Relaxed aliasing is one of them.

BOOL bListBoxLineHasData ( HWND hListBox, int iIndex,

LPSTR szReturn )

{

char szListBoxLine[128];

*szListBoxLine = 0; /* Clear out the String */

SendMessage( hListBox,

LB_GETTEXT,

iIndex,

(LONG)(LPSTR)szListBoxLine);

if (*szListBoxLine)

{

lstrcpy( szReturn, szListBoxLine );

return TRUE

}

return FALSE;

}

This code looks innocent enough, but it will always return false if you compile with aliasing relaxed. You are casting the array pointer szListBoxLine to a long, and a long is passed by value, not by reference. The C compiler assumes that there is no way the data pointed to by szListBoxLine could have changed since the statement *szListBoxLine = 0;

and therefore removes the entire if block.

When you compile for debugging in Microsoft C, use the -Zi switch. Don’t use it on the modules you have debugged already, because it will fatten up your EXE file a lot, which could tax your debugger.

When you link, it’s a good idea to use the /MAP switch and run MAPSYM.EXE. This will give you much more information when you get a fatal exit. The techniques used in this article to track down fatal exits assume you have used these tools.

The second major rule of debugging is: always run in the debugging version of Windows while developing an application. Many programs are released that simply won’t run in the debugging version of Windows. (Actually, it’s amazing how some programs are released that won’t run in the retail version either!) Some fatal exits occur only under unusual circumstances, and every one you find while programming is one less your testers need to find. Of course, you should have at least one of your testers running in the retail version. In theory, any application that runs clean in the debugging version of Windows will run in the retail version of Windows. However, because these are separate programs, you must test on both.

Tracking Down Bugs

Now for the fun part. How do you find the bugs that cause these headaches? Let’s start with fatal exits, since they are the most common. Probably ninety-nine percent of fatal exits are caused by passing invalid parameters into Windows. Consider the following code:

hWnd = GetFocus();

PostMessage( hWnd, WM_USER+100, 0, 0L );

while (PeekMessage(&msg, 0, 0, 0, PM_REMOVE ))

{

TranslateMessage(&msg);

DispatchMessage(&msg);

}

DestroyWindow( hWnd );

This code assumes that the window was not destroyed during the PeekMessage loop! A lot of things could have happened during that loop. The user might have closed the window. The WM_USER+100 message may have caused the window to close. The hWnd could have been corrupted by a stack overflow. Or, even worse, the hWnd retrieved from GetFocus might not even belong to that application! Who knows what that other application will do with the WM_USER+100?

To fix the code, you should test whether the window is qualified for the WM_USER+100 message after you’ve gotten the hWnd from the GetFocus call. After your code passes this test and after the PeekMessage loop, the IsWindow function should be used to determine if your window can be destroyed.

Other common fatal exits are caused by using a Windows API without its complement, such as a GlobalLock without a GlobalUnlock. Too many GlobalLocks and you RIP. If you create objects without deleting them when finished, you eventually run out of space and RIP.

static HBRUSH hGreenBrush;

case WM_CTLCOLOR:

hGreenBrush = CreateSolidBrush( RGB (0,255,0));

return hGreenBrush;

case WM_DESTROY:

DeleteObject( hGreenBrush );

break;

This code assumes that the WM_CTLCOLOR message is sent only once. Actually, this message is sent whenever any control needs painting--about a zillion times during an application’s life. This code creates a new brush each time; the old one is left in the GDI heap forever. Eventually the GDI heap will overflow and the system will crash. There are two ways you can fix this code.

static HBRUSH hGreenBrush = 0;

case WM_CTLCOLOR:

if (!hgreenBrush)

hGreenBrush = CreateSolidBrush( RGB(0,255,0));

return hGreenBrush;

case WM_DESTROY:

DeleteObject( hGreenBrush );

hGreenBrush = NULL;

break;

This method creates the brush once and uses it over and over. Another solution involves creating the brush in the WM_INITDIALOG case and destroying it when you exit the dialog. Either method is OK; the above code keeps the code to create the brush together with the code that uses it.

Trashing USER’s data segment (DS) is another way to cause a RIP. The easiest way to do that is to mix and match DCs, or to use window extra bytes that don’t belong to you. If you want to declare 4 extra bytes, don’t call SetWindowLong as follows:

SetWindowLong( hWnd, 2, dwValue);

The last two bytes will walk all over something else in USER’s DS, most likely someone else’s extra bytes. Look at this sample code:

hDC = GetDC( GetDlgItem( hDlg, IDOK ));

o/* draw draw draw */

ReleaseDC( hDlg, hDC );

This mistake is easy to make, because everybody was told that dialogs use the parent’s DC. Therefore, the parent DC (from hDlg) should be the same. Well, this is not true. When you release the DC, USER ignores the request. This may cause unpredictable behavior later on.

Once USER’s DS is trashed, perfectly valid calls can cause fatal exits. Once you locate the point of a fatal exit, check the code. If the code is good, and you have verified everything else, start to look for things that trash USER’s DS (or your own DS). These techniques are discussed below.

Tracking Down Fatal Exits

Getting a fatal exit is good news of sorts: it means that you’ve uncovered a very findable, fixable bug. There are some things you must do in order to reproduce a fatal exit.

Have the debugging version of Windows installed.

Use a debugger (such as the CodeViewÒ debugger for Windows) that redirects the output to its command window. You can have a secondary output device connected to AUX. You can put a dumb terminal on COM1, or you can put a secondary monochrome monitor in your machine and use OX.SYS to redirect to that monitor. (OX.SYS can be found on any MSJ bulletin board). If you use the secondary monochrome monitor, you will need to cold-boot it after getting a fatal exit (the keyboard is locked up). There is also a program called WINOX.EXE (also on any MSJ bulletin board) that redirects the output to a window, but then you have to allow for the Heisenberg Uncertainty Principle, which says that the accurate measurement of an observable quantity necessarily produces uncertainties in one’s knowledge of the values of other observables. In this context it means that since measuring a value can change the value, you can’t necessarily determine what caused the fatal exit.

Have a bug in your code.

Reading stack traces from a fatal exit can be very difficult or very easy, depending on if you compiled and linked your application correctly. In Microsoft C, if you compiled with the -Zi switch, linked with the /MAP switch, and ran the MAPSYM program, you will get a stack trace that looks like this:

Fatal Exit code = 0x0007

Stack Trace

USER!_FFFE:SHOWCURSOR+0389

USER!_MSGBOX:08D7

USER!_FFFE:922d

BASE!_TEXT:MainWndProc+0028

USER!_FFFE:DISPATCHMESSAGE+004A

BASE!_TEXT:_DoMain+0030

BASE!_TEXT:WINMAIN+0031

BASE!_TEXT:__astart+0060

Abort, Break or Ignore?

If you compiled with the -Fc switch, you have a combined source/object listing/file (COD). From this, you can see the exact line of the fatal error. The file I compiled was BASEPROG.C. An excerpt from BASEPROG.COD is in Figure 1.

Figure 1 BASEPROG.COD

; Line 43

PUBLIC MAINWNDPROC

MAINWNDPROC PROC FAR

*** 000000 1e push ds

*** 000001 58 pop ax

*** 000002 90 nop

*** 000003 45 inc bp

*** 000004 55 push bp

*** 000005 8b ec mov bp,sp

*** 000007 1e push ds

*** 000008 8e d8 mov ds,ax

ASSUME DS: NOTHING

*** 00000a 81 ec 00 00 sub sp,0

*** 00000e 57 push di

*** 00000f 56 push si

; hWnd = 14

; wMessage = 12

; wParam = 10

; lParam = 6

;|*** switch (wMessage) {

; Line 44

*** 000010 8b 46 0c mov ax,WORD PTR [bp+12] ;wMessage

*** 000013 e9 40 00 jmp $S1177

;|*** case WM_LBUTTONDOWN:

; Line 45

$SC1181:

;|*** MessageBox(ghInstance, "Watch me Fatal Exit!",

Wow!", MB_OK);

; Line 46

*** 000016 ff 36 00 00 push WORD PTR _ghInstance

*** 00001a b8 05 00 mov ax,OFFSET DGROUP:$SG1183

*** 00001d 1e push ds

*** 00001e 50 push ax

*** 00001f b8 00 00 mov ax,OFFSET DGROUP:$SG1182

*** 000022 1e push ds

*** 000023 50 push ax

*** 000024 b8 00 00 mov ax,0

*** 000027 50 push ax

*** 000028 9a 00 00 00 00 call FAR PTR MESSAGEBOX

;|*** break;

; Line 47

*** 00002d e9 38 00 jmp $SB1178

;|***

Now, let’s look back at that fatal exit. Fatal exit 0007H results from an invalid window handle (see Figure 2). The top of the stack trace is the last function called before you died. This stack trace implies that you were in the ShowCursor function when you expired (USER!_FFFE:SHOWCURSOR+0389). This is a lie. In Windows 3.0, the code that checks for invalid window handles is in SetCursor. Probably every fatal error 0007H you’ll ever see in Windows 3.0 and 3.1 comes from SetCursor. Besides, SetCursor is not your code anyway, so the bug could not have happened in response to a call to SetCursor. The next line says that MessageBox was the previous function (USER!_MSGBOX:08D7). At this point it looks like MessageBox called ShowCursor, but it makes no sense to blame ShowCursor since the problem is most likely in your code. Well, now you need to find out who called MessageBox. It was you (BASEPROG!_TEXT:MainWndProc+0028)! Look at Figure 1 and locate the beginning of MainWndProc:

PUBLIC MAINWNDPROC

MAINWNDPROC PROC FAR

*** 000000 1e push ds

The number in italics is the starting address of this function. In this case, it is 000000H, which makes finding this bug easy. Adding 0028H to 000000H equals 000028H (handy tip: if you always put your buggy functions as the first function in a code segment, you won’t need to do any hex addition). Now, look at the last line of assembly under line 46:

*** 000028 9a 00 00 00 00 call FAR PTR MESSAGEBOX

This is the offending MessageBox call. Checking the parameter of the MessageBox, you see that you passed in an hInstance instead of an hWnd. There’s the bug.

Figure 2 Fatal Error Codes

Windows 3.0 Fatal Error Codes

Number

Description

0000H	Invalid handle passed to a GDI function (such as a bad brush)
0001H	Insufficient memory for allocation
0002H	Error reallocating memory
0003H	Memory cannot be freed (probably still locked)
0004H	Memory cannot be locked (probably discarded)
0005H	Memory cannot be unlocked (probably not locked)
0006H	Invalid GDI object
0007H	Window handle not valid
0008H	Cached display contexts are busy (max 5 at a time, most likely caused by forgetting to ReleaseDC somewhere)
000AH	Clipboard already open
000BH	Cached display context not released by window being destroyed
000CH	Mouse module not valid
000DH	Display module not valid
000EH	Unlocked data segment should be locked
000FH	Invalid lock on system queue
0100H	Local memory errors
0140H	Local heap is busy (usually means DS trashed, stack overflow)
0180H	Invalid local handle
01C0H	LocalLock count overflow
01F0H	LocalUnlock count underflow
0200H	Global memory errors
0240H	Critical section problems
0280H	Invalid global handle
02C0H	GlobalLock count overflow
02F0H	GlobalUnlock count underflow
0300H	Task schedule errors
0301H	Invalid task ID
0302H	Invalid exit system call
0303H	Invalid BP register chain
0400H	Dynamic loader/linker errors
0401H	Error during boot process
0402H	Error loading a module
0403H	Invalid ordinal reference
0404H	Invalid entry name reference
0405H	Invalid start procedure
0406H	Invalid module handle
0407H	Invalid relocation record
0408H	Error saving forward reference
0409H	Error reading segment contents
0410H	Error reading segment contents
0411H	Insert disk for specified file
0412H	Error reading nonresident table
04FFH	INT 3FH handler unable to load segment
0500H	Resource management/user profile errors
0501H	Missing resource table
0502H	Bad resource type
0503H	Bad resource name
0504H	Bad resource file
0505H	Error reading resource
0600H	Atom management errors
0700H	Input/output package errors
Number	Description

0001H	Insufficient memory for allocation
0002H	Error realloc memory
0003H	Memory cannot be freed
0004H	Memory cannot be locked
0005H	Memory cannot be unlocked
0006H	Invalid GDI object
0007H	Invalid window handle
0008H	Cached display contexts are busy
0009H	No DefWindowProc present in window procedure
000AH	Clipboard already open
000BH	App did a GetDC and destroyed window without releasing DC
000CH	Invalid keyboard driver
000DH	Invalid mouse driver
000EH	Invalid cursor module
000FH	Unlocked data segment should be locked
0010H	Invalid lock on system queue
0011H	Caret is busy
0013H	One hwnd owns all the DCs (forgot to ReleaseDC?)
0019H	Illegal window style bits were set
001AH	App that registered a global class didn’t unregister it
001BH	Bad hook handle
001CH	Bad hook ID
001DH	Bad hook proc
001EH	Bad hook module
001FH	Bad hook code
0020H	Hook not allowed
0021H	Unremoved property
0022H	Bad property name
0023H	Bad task handle
0027H	Bad negative index for Get/Set/Window
0028H	Bad positive index for Get/Set/Window
0029H	App called DestroyWindow on a DialogBox window
002AH	Dialog control ID not found
002CH	Invalid hMenu
002DH	Invalid metafile pasted into Clipboard
002EH	MessageBox called with no message queue initialized
002FH	DLGWINDOWEXTRA bytes not allocated for dialog
0030H	Intertask send message with tasks locked
0031H	Invalid parameter passed to a function
0033H	Invalid function was called
0034H	LockInput called when input was already locked or when never locked.
0035H	SetWindowLong uses a NULL window procedure
0036H	SetWindowsHook is used to unhook
0037H	PostMessage failed due to full queue
0100H	Local memory manager errors
0200H	Global memory manager errors
0300H	Task scheduler errors
0400H	Dynamic loader/linker errors
0401H	Error booting
0401H	Unable to load a file
0500H	Resource manager errors
0501H	Missing resource table
0502H	Bad resource type
0503H	Bad resource name
0504H	Bad resource file
0506H	Bad parameter to profile routine
0600H	Atom manager errors
0700H	I/O package errors
0800H	Parameter checking RIP

Note::

Windows 3.1 also includes descriptive text with the fatal exit. You should always trust the description over the numbers, because the numbers are often used in multiple places.

Sometimes you will get a zillion fatal exits, all streaming one after another, flying right off the screen. The first one is the one you need to catch; subsequent fatal exits are caused by Windows crashing.

When a fatal exit happens on your call to DefWindowProc, things get a bit trickier. This means that you have corrupted something in the Windows internal data structures. A really common problem is if the parent handle is invalid and the child calls DefWindowProc. Other causes are if you trash USER’s DS and Windows loses track of things. To find these types of fatal exits, you can use the INI tracker (see "The INI Tracker").

The INI Tracker

I have found the INI tracker to be a big help in debugging applications that only crash when running the retail version of Windows, especially when a whole bunch of other things are going on. The INI tracker also helps with those rare bugs that cause your system to reboot or drop to MS-DOS.

All the INI tracker does is write a line of debug code to an initialization file (INI) you specify. Let’s say you suspect that the bug is in the function UpdateRecords. You use INI tracking to verify this.

void UpdateRecords

( int iRecordNumber, LPSTR szRecord )

{

WritePrivateProfileString(szAppName, szBugName,

"UpdateRecords", szBugIniFileName);

// Force a flush of the INI file

WritePrivateProfileString(NULL, NULL, NULL,

szBugIniFileName);

o // suspect code here

WritePrivateProfileString(szAppName, szBugName, NULL,

szBugIniFileName);

WritePrivateProfileString(NULL, NULL, NULL,

szBugIniFileName);

}

This program assumes that szAppName points to a string such as MyApp and szBugName points to a string such as Hang. If the suspect code did hang, there would be a line in your INI file:

[MyApp]

Hang=UpdateRecords

If the suspect code was successful, the line would be completely deleted from the INI file. Of course, there is a speed hit involved when you use this tracker and you should disable write-caching on your system to ensure that strings get written to the INI file when you want them to. But it is very easy to program (especially if you write C macros to automate it for you), and it leaves no files open or data structures around. After your machine finishes rebooting, you can check the INI file and quickly find out if the problem happened where you suspected.

This tracking mechanism is most helpful when other methods won’t work, or when the problem only happens on the user’s machine (after all, users find most of the bugs). You can leave this code in your retail version simply by using conditional logic to write the profile information. If a customer is running your application and it hangs, you can instruct them to do some backdoor type of operation to turn on the INI tracker. You can then have them tell you what the INI says, and then you can get to work.

UAEs

A recent study by Microsoft finds that at least 66 percent of UAEs are caused by passing an invalid parameter to an API. Passing an hWnd instead of an hDC causes one. Using hIcon instead of hBitmap causes one. Passing a pointer instead of a handle causes one. Fortunately, Windows 3.1 has extensive parameter validation that makes most of these UAEs go away. However, programmers must continue to deal with them until everyone has upgraded to version 3.1.

A stack overflow can cause all sorts of problems. In Windows 3.0, if you overflow your stack, quite often your application will just shut down, without warning, without a fatal exit, without a UAE. Other times, your stack overflows just enough to dip into your static data area and corrupt some global variables. These are very easy to find once you implement a variable integrity test (discussed below).

Stack overflows can also cause UAEs. If you have a recursive function, don’t allocate a lot of local variables. Before you know it, your stack overflows and you trash your DS. While your stack pointer clashes with your global variables, you modify a global variable. This in turn trashes your stack frame. Now your local variables mysteriously change and you cause a UAE when you shouldn’t. Also, if you trash the wrong stack variable (the return address of the calling function, for example), you will get a UAE and hang the system rock solid.

Abusing local variables can also cause a UAE. Consider the following example:

BOOL bFunction( HANDLE hWnd )

{ LPINT lpInt;

SendDlgItemMessage( hWnd, LB_GETSELITEMS, 10, lpInt );

}

This isn’t really a Windows programming error, it is more a C programming error. There is no memory allocated for those integers, just a pointer to nowhere. This line should cause a UAE immediately. However, imagine if it was changed slightly.

BOOL bFunction( HANDLE hWnd )

{

int *pInt;

SendDlgItemMessage( hWnd, LB_GETSELITEMS, 10,

(LPINT)pInt );

}

This code will trash your DS. The long pointer you are supplying has a random offset, but the segment/selector has been cast to your own DS. Who knows what gets changed here? If you’re lucky, you will exceed your segment bounds and generate a UAE. In bigger programs, where the DS is close to 64KB in size, there is a greater chance that the offset will fall within segment bounds. So if your program worked fine when it was big, but then a UAE occurred when you removed global variables, there is a chance that you are doing something like this. Now, here is the real gotcha on the above code: since you are trashing your DS, another function may crash and burn, even though its code is perfectly fine. Bummer.

Memory overwrites are the easiest UAEs to find. Here you have an invalid pointer and you are going to dereference it. CodeView usually can put you on the offending line when this happens. Usually the chunk of memory you allocated was not big enough, or you used an integer instead of a word when indexing arrays.

If you are using a huge memory object, make sure always to use huge pointers. If you don’t, the C compiler assumes large data, and does not do the correct segment arithmetic. (Also, huge memory management is slow, so avoid it if you can.)

Forgetting to export a function in the DEF file can cause a UAE. Forgetting to MakeProcInstance in an application can cause a UAE. Everybody makes these errors from time to time.

Trashing the data segments belonging to Windows will cause a UAE also. Here you can usually figure out that a Windows DS was trashed, because you will get one UAE after another until the system hangs. If your application has a UAE, the application usually just goes away. When Windows has a UAE, you need to reboot.

Here’s the third rule of debugging: always, always, always check your return values with memory manipulation! If you do a GlobalAlloc and/or GlobalLock, make sure you succeed. GlobalLock likes to return NULL under low memory conditions (or if the memory was discarded), and you will cause a UAE if you don’t react accordingly.

When you get a UAE, the CodeView debugger can often point you to the offending line of code. Unlike fatal exits, you won’t need to read stack traces. However, if the UAE happens in a seemingly useless hunk of assembly-language code, check the Calls menu in CodeView and hope one of your functions is listed. The assembly-language code means that there is no symbolic information, which usually means that the UAE happened in Windows. You can use the INI tracker if you have no idea where the bug occurs. If you have a pretty good idea, you can call OutputDebugString to dump a bunch of information to the secondary monitor. When the UAE shows up, see what happened last.

OutputDebugString is superior to MessageBox for debugging. MessageBox has a great name--it causes lots of extra messages to be sent to your application, which practically ruins your chance of finding the bug (remember the Heisenberg Uncertainty Principle). OutputDebugString is fast and does not affect the flow of your program. It also stays on the screen long after your program dies. If you are using a serial terminal, the information stays on the screen even after you reboot.

Trashed Data

Trashed data causes UAEs, fatal exits, ulcers, and angry users. There are lots of ways to trash your DS. Basing your data on message ordering or return values can burn you. Make sure that you do not count on a message being sent to set up a pointer or structure. Message queues get full, and extra PeekMessage loops can totally randomize the order of message processing. If you depend on another message to have happened, use a flag instead. Better yet, don’t code in these dependencies.

Other applications killing you is a concern. Someone who did not debug their application may be trashing your DS and causing you grief. Unfortunately, there are not a lot of things you can do to prevent this. You could put as much of your data as possible into globally allocated chunks of memory, and use the DOS Protected Mode Interface (DPMI) to protect the selectors (see below), but this won’t protect you from trashing yourself. A better approach is to put some code in to check the integrity of your variables, and have your program bail out if these variables have been corrupted. (You could also call the developers of the offending application and tell them what their program is up to!)

Mismatching parameters is a good way to cause data to get trashed. Passing pointers around can easily get you into trouble. Using strict function prototypes (as defined in WINDOWS.H) can help avoid switching params. Avoiding typecasting wherever you can also helps.

Run-time functions that assume near data will get you into boatloads of trouble. When you see the compiler warning "Segment Lost In Conversion," be careful. This usually means that you are passing a far pointer to a function that expects a near pointer. What does the compiler use for the DS? Yours, of course. The C run-time library can get you into trouble, especially the sprintf function. Your best bet is to avoid the C run-time library whenever you can and use Windows API functions like wsprintf and lstrcpy instead. They were written with the full knowledge of all of the gotchas that exist in Windows.

If your DS gets trashed, there is a high probability that your global variables are going to change value unexpectedly. You can implement a crude device to track some of these violations. Consider the following declaration of global variables:

HWND ghWnd;

HANDLE ghInst;

char szFullName[20];

char szAppName[] = "BADAPP";

char szCaption[] = "Buggy Windows App";

char szMenuName[] = "PlainMenu";

char szDialogTemplateName[] = "MODALDIALOG";

These four variables take up 8 consecutive bytes in the DS. If they are getting trashed, you can sandwich them between assertion variables, which are preset with a value that should never change. Define two macros like this:

#define ASSERT_VAR(p1) int p1 = 42;

#define UNDEF_ASSERT_VAR(p1) int p1;

The value 42 is arbitrary; use your favorite number (except 0). Now, change the declarations to look like this:

UNDEF_ASSERT_VAR(asv1);

HWND ghWnd;

UNDEF_ASSERT_VAR(asv2);

HANDLE ghInst;

UNDEF_ASSERT_VAR(asv3);

char szFullName[20];

UNDEF_ASSERT_VAR(asv4);

ASSERT_VAR(asv5);

char szAppName[] = "BADAPP";

ASSERT_VAR(asv6);

char szCaption[] = "Buggy Windows App";

ASSERT_VAR(asv7);

char szMenuName[] = "PlainMenu";

ASSERT_VAR(asv8);

char szDialogTemplateName[] = "MODALDIALOG";

ASSERT_VAR(asv9);

The indenting of the "real" variables makes it more readable. Uninitialized variables are sandwiched between assertion variables declared using the UNDEF_ASSERT_VAR macro. Initialized variables are sandwiched between assertion variables declared using the ASSERT_VAR macro. This distinction is important because only initialized variables are arranged by the Microsoft C/C++ compiler version 7.0 in the order you see them. The preinitialized variables are moved to an adjacent area of the DS. Separate macros are needed so the assertion variables are moved to the same locations as the program variables.

The InitializeDSSafeGuard function must then be called to initialize the assertion variables that were declared using the UNDEF_ASSERT_VAR macro.

Next, add a function to your program (see Figure 3). AssertTrashedDS puts up a system modal box displaying a history of the last ten values passed to it (see Figure 4). At the beginning of every one of your functions, add the following line:

AssertTrashedDS( UNIQUE_NUMBER);

Make sure you use a unique number for each call. When the sysmodal box comes up, you will have a ten-level history of your application. Hopefully this will give you enough information to find the code that caused that problem. For example, if the numbers displayed on the box looked like,

263,62,92,*,903,62,82,62,203,30

this would indicate that the assertion failed when you passed 92 to it. The asterisk indicates the end of the queue. Therefore, the DS was trashed in the code that executed between the time you called AssertTrashedDS(62) and AssertTrashedDS(92). You can add extra calls to AssertTrashedDS to narrow down the place your DS is getting trashed.

Figure 3 AssertTrashedDS

#define CHECK(p1,p2) \

if (42 != p1) \

{ \

if (!wBadVariable) \

{ \

wBadValue = p1; \

p1 = 42; \

wBadVariable = p2; \

} \

}

void AssertTrashedDS ( WORD wAssertValue )

{

static WORD wQueueValue; // The next open spot in the queue

static WORD wQueue[10]; // The actual queue

WORD wBadVariable;

WORD wBadValue;

// These three lines add the passed in AssertionValue into the queue,

// and zero out the next spot (which will be displayed as an asterisk)

wQueue[wQueueValue] = wAssertValue;

wQueueValue = (wQueueValue+1)%10;

wQueue[wQueueValue] = 0;

// The CHECK macro above will only set the wBadValue and wBadVariable

// values if no assertion violation has occured. Since it is possible

// that multiple assertion variables could have been trashed at the

// same time, we will continue this loop as long as *any* of the

// assertion variables are trashed. Note that the CHECK macro restores

// the trashed assertion varaible once it has detected it. This

// prevents inifinite loops.

{

// Resetting this varaible makes the assumption that no trashing was done

wBadVariable = 0;

// Check all of the variables. Remember that once one trashing has been

// found, the remaining trashed varaibles will remain trashed. The next

// iteration of the do-loop will find the next trashed variable, and

// so on until all trashed variables have been reported.

CHECK(asv1,1);

CHECK(asv2,2);

CHECK(asv3,3);

CHECK(asv4,4);

CHECK(asv5,5);

CHECK(asv6,6);

CHECK(asv7,7);

CHECK(asv8,8);

CHECK(asv9,9);

CHECK(asv10,10);

CHECK(asv11,11);

// If we actually found a trashing, report it

if (wBadVariable)

{

int i;

char szMsg[255];

char szMsg1[255];

char szMsg2[128];

char szNum[10];

// Generate a string of the queue values, using a "*" for the zero

// value to indicate the end of the queue

*szMsg1 = 0;

for ( i = 0; i < 10; i++ )

{

if (wQueue[i])

wsprintf ( szNum, "%d", wQueue[i]);

else

lstrcpy ( szNum, "*" );

if (i != 9)

lstrcat ( szNum, "," );

lstrcat ( szMsg1, szNum );

}

// Create a cute string showing the programmer/unlucky enduser the list

wsprintf ( szMsg, "%s (%s)",

(LPSTR)"Assertion Failure",

(LPSTR)szMsg1

);

// Create a caption saying which varaible was trashed, and its value

wsprintf(szMsg2, "Assertion Variable asv%d=%d", wBadVariable, wBadValue );

// Alert the programmer/unlucky enduser

MessageBox ( NULL, szMsg, szMsg2, MB_SYSTEMMODAL );

} // end if bad value

}

while (wBadVariable); // Keep doing this until assertion varaibles are clean

} // end function

void InitializeDSSafeGuard(void)

{

asv1 = 42;

asv2 = 42;

asv3 = 42;

asv4 = 42;

}

Figure 4 AssertTrashedDS displays the last ten values passed to it.

This method is not bulletproof. It uses static variables, which themselves can become corrupted. Also, it does not check every single byte of your DS (this would be impossible, since every time you change a global variable, the bytes in the DS change). However, if your DS is getting trashed randomly, putting in this code will eventually point out the cause. Once your DS is trashed, bailing out fast is a great idea.

Using OutputDebugString to display LONG strings that you create in your DS can also help. For example, add a line like this:

OutputDebugString ( "Windows programming makes me drink

too much.");

Watch for this message. See if it changes value. If the following message appears on your debugging monitor, you know you have problems:

Windows programm26^%#&!!!!!!!!!!!!!!!! <beep>

Also, you may suspect that a string constant is getting trashed. Look at a call to CreateWindow:

hWnd = CreateWindow ( "edit",

"My Window",

o /* all the rest */

);

If the bytes in the DS where "edit" is get trashed, your CreateWindow will most likely fail or crash. But CodeView3 won’t let you examine the bytes because there is no symbol for them. Consider adding a global variable like this,

char szEditClassName[] = "edit";

and use it throughout your program. This is a good idea because you only have those characters stored in your DS once, and you can examine them with the CodeView debugger or many other tools.

You can also use the INI tracker to figure out what went wrong. The INI tracker isn’t perfectly suited for this type of debugging, unless a wild write causes a crash. But I’ve been talking it up in every other section, so one more plug couldn’t hurt.

Hangs

When your program hangs, you have committed the perfect crime. You can’t look at any data, you can’t switch to another application, you can’t do anything except reboot. Lots of things cause hangs. Generating a message in response to a message is one. One cute way to cause a hang is to call InvalidateRect after an EndPaint. Your screen will redraw itself into oblivion. Generating a MessageBox in response to a WM_KILLFOCUS could get you, too, because the MessageBox will cause the WM_KILLFOCUS message to happen. Types of hangs where the screen does something are generally easy to find.

An infinite loop is a lot harder to find. Here your application seems to take a really long time to do that recalc, and you notice that you cannot task-switch. The mouse still moves, but nothing else works. Logic errors usually cause infinite loops.

Passing an invalid parameter can cause a hang, but usually something else (a UAE or a fatal exit) happens first.

OutputDebugString is excellent for finding hangs. When your program hangs, look at the secondary monitor. If a bunch of stuff is still flying by, you can easily go to your code and find the loop. If nothing moves on the secondary monitor, then your infinite loop has no OutputDebugString calls in it, or you have killed Windows.

Killing Other Applications

Wild writes are the most common reason another application gets killed. Unfortunately, there is not much you can do to prevent yourself from getting wiped out by another application. You could try to keep as little in your data segment as possible, so there is less chance of it getting hit. You could check your data integrity variables to see if their data changes value when it should not. The best thing to do, though, is to make sure that your application does not kill anything else. After a while, the applications that kill others will be pointed out, and they will be fixed.

Using another application’s window extra bytes is the most common way to trash someone else. Extra bytes are simply a LocalAlloc call made by USER. If you overstep the boundaries, you walk all over something else allocated by USER. That object could be a DC, somebody else’s extra bytes, a menu, who knows? When your buttons start changing their text from "OK" to "(3#!h," chances are very good that USER’s DS is getting trashed.

Using invalid handles can also wipe out another application. If you use the wrong parameters in the DeleteDC call, you could cause this DC to be left around for a very long time.

Not checking return values can cause all kinds of trouble. For example, look at the following code:

case WM_INITDIALOG:

EnableMenuItem( GetSystemMenu(hDlg, FALSE),

SC_CLOSE, MF_GRAYED );

return TRUE;

Innocent enough until you remember that you got rid of the system menu from that dialog box. Who knows what happens inside USER? Hopefully, USER will just ignore it, but it’s a good idea not to make that assumption.

Not exporting functions in the DEF file is another great way to trash other applications or USER. By not exporting your function, you are not setting up the thunk correctly. So, instead of Windows setting the DS to your application, it uses the DS of . . . of . . . USER! Or, in rare cases, another application’s DS. In any case, it won’t usually be your DS. If you are lucky, the phantom DS you are using will be smaller than yours, so you’ll cause a UAE when you access a variable high in "your" DS (because you’ll exceed their segment bounds).

A method you can use to help ensure that your variables don’t get trashed by somebody else can be found in Figure 5. Each instruction is in its own _asm segment because the C compiler removes the carriage return/linefeed when expanding the macro. This method uses the macros UNWRITEPROTECTsc and WRITEPROTECTsc to call the DPMI_SETACCESSRIGHTS function to set the read/write bit of a selector. This way, if another application tries to write over your precious data it will cause a UAE (versus your application causing a UAE when it accesses corrupted data).

Figure 5 Protecting Your Variables

/* Author: Bryan Woodruff */

#define DPMI_SETACCESSRIGHTS 9

#define WRITEPROTECTsc(wSelector) \

{ \

_asm \

{ \

mov bx, wSelector \

} \

_asm \

{ \

lar cx, bx \

} \

_asm \

{ \

xchg cl, ch \

} \

_asm \

{ \

and cl, 0xFD \

} \

_asm \

{ \

mov ax, DPMI_SETACCESSRIGHTS \

} \

_asm \

{ \

int 0x31 \

} \

}

#define UNWRITEPROTECTsc(wSelector) \

{ \

_asm \

{ \

mov bx, wSelector \

} \

_asm \

{ \

lar cx, bx \

} \

_asm \

{ \

xchg cl, ch \

} \

_asm \

{ \

or cl, 2 \

} \

_asm \

{ \

mov ax, DPMI_SETACCESSRIGHTS \

} \

_asm \

{ \

int 0x31 \

} \

}

Your Precious Variables

typedef struct tagPRECIOUSDATA

{

int iValue;

char szCode[100];

}

PRECIOUSDATA;

typedef PRECIOUSDATA FAR *LPPRECIOUSDATA;

#define precious_iValue (lpPreciousData->iValue)

#define precious_szCode (lpPreciousData->szCode)

LPPRECIOUSDATA lpPreciousData;

WORD wPreciousSel;

Initialization

lpPreciousData = GlobalLock(GlobalAlloc ( GHND, size of(PRECIOUSDATA)));

wPreciousSel = HIWORD(lpPreciousData);

if (!lpPreciousData)

{

o/* kill your app here */

}

else /* this write protects it at init time */

{

WRITEPROTECTsc(wPreciousSel)

}

Shutdown

{

HANDLE hMem = LOWORD(GlobalHandle( wPreciousSel ));

GlobalUnlock(hMem);

GlobalFree(hMem);

}

Usage

UNWRITEPROTECTsc(wPreciousSel)

precious_iValue = 42;

lstrcpy(precious_szCode, "Kirk here");

WRITEPROTECTsc(wPreciousSel)

Parameter Validation

As mentioned, Windows 3.1 validates every parameter upon entry into a function. In fact, it does a very thorough job. If you pass a pointer to a function, Windows checks the LDT to see if that pointer is valid and if that pointer addresses a block of memory large enough to allow the called function to work correctly. If you pass in handles to objects, Windows checks them and if you pass in flags, Windows makes sure you are using the correct combination of flags. If a UAE happens, it is going to happen in your code, not Windows code. There are three levels of behavior you can observe as a result of parameter validation. The first level is what the Average End User sees. Most likely, this user will not have Dr. Watson running, and will not have ShowInfo=par in the [Dr. Watson] section of WIN.INI. In this case, functions called with invalid parameters just return a failure code, and if the offending program does no error recovery, it merrily goes on its way while the user wonders why his document didn’t print (for example). The second level is what the Curious End User sees. This user has Dr. Watson running, and has ShowInfo=par in the [Dr. Watson] section of WIN.INI. In this scenario, functions called with invalid parameters notify the kind Dr. Watson, and their medical records are saved in the DRWATSON.LOG file. The user will not only wonder why the document did not print, but also why the Dr. Watson log keeps getting quite large. If you are lucky, he will send you his files and you can fix those bugs. The third level, where you should always be if you are developing software, is the Perfectionist Developer level. This person has the Doctor running, has WIN.INI set correctly, and has the debug version of Windows installed. In this setup, every single thing you do wrong (and this includes forgetting to clean up objects) will be reported. If you can run your application in the Perfectionist Developer level cleanly, then you are halfway to the road of Software Coolness.

The information Dr. Watson gives you is invaluable. The stack traces have the same format as in Figure 6. You can implement parameter validation in your own functions using ToolHelp. For example, the SetFocus function may look something like this:

HWND SetFocus( HWND hWnd )

{

if (!IsWindow(hWnd))

{

OutputDebugString("Invalid hWnd sent to

SetFocus()");

return NULL;

}

o/* all the rest */

This will make programming Windows a whole lot easier. When you mix up parameters or do things wrong, Windows lets you know about it.

Figure 6 BADAPP's Dr. Watson Log

Dr. Watson 0.80 Failure Report - Fri May 1 01:21:25 1992

BADAPP had a 'Invalid Parameter (6041)' fault at USER 8:11c9

$tag$BADAPP$Invalid Parameter (6041)$USER 8:11c9$param is 45000000$Fri May 1 01:21:25 1992

$param$, Invalid handle passed to USER 8:11c9: 0x0000

Stack Dump (stack)

Stack Frame 0 is USER 8:11c9 ss:bp 33e7:cb8c

Stack Frame 1 is BADAPP 5:0647 ss:bp 33e7:cba8

Stack Frame 2 is USER 23:036b ss:bp 33e7:cbc0

Stack Frame 3 is USER 23:03a4 ss:bp 33e7:cbd8

Stack Frame 4 is USER 1:3a6c ss:bp 33e7:cbf8

Stack Frame 5 is USER 22:08c8 ss:bp 33e7:ccd4

Stack Frame 6 is USER 22:0080 ss:bp 33e7:ccfa

Stack Frame 7 is USER 22:0023 ss:bp 33e7:cd12

Stack Frame 8 is BADAPP 58:0983 ss:bp 33e7:cd5a

Stack Frame 9 is BADAPP 43:0c59 ss:bp 33e7:d0a2

Stack Frame 10 is BADAPP 41:2169 ss:bp 33e7:d54a

Stack Frame 11 is USER 1:3a6c ss:bp 33e7:d56a

Stack Frame 12 is BADAPP 19:2aaa ss:bp 33e7:d6aa

Stack Frame 13 is USER 1:27c4 ss:bp 33e7:d6c0

Stack Frame 14 is BADAPP 14:0e8e ss:bp 33e7:d70c

Stack Frame 15 is BADAPP 1:00ac ss:bp 33e7:d742

If you want to get full parameter validation and full debugging information, be sure to use the debugging version of Windows 3.1, and include the following lines in your WIN.INI:

[Dr. Watson]

ShowInfo=par

Now, let’s cause a parameter validation error. In fact, let’s use the code from above (the bad WM_INITDIALOG code):

case WM_INITDIALOG:

EnableMenuItem( GetSystemMenu(hDlg, FALSE),

SC_CLOSE, MF_GRAYED );

return TRUE;

When you run it, a huge amount of information is put in the Dr. Watson log. The useful information is shown in Figure 6.

Then check the DEF file in Figure 7. The italicized line is the fifth segment in the DEF file, which contains 5:0647. Now look at FLAGS.COD (which was generated using the -Fc compiler switch):

;|***

;|*** EnableMenuItem( GetSystemMenu(hDlg, FALSE),

SC_CLOSE, MF_GRAYED );

; Line 201

*** 000632 ff 76 0e push WORD PTR [bp+14];hDlg

*** 000635 6a 00 push 0

*** 000637 9a 00 00 00 00 call FAR PTR GETSYSTEMMENU

*** 00063c 50 push ax

*** 00063d 68 60 f0 push -4000

*** 000640 6a 01 push 1

*** 000642 9a 00 00 00 00 call FAR PTR ENABLEMENUITEM

;|*** return TRUE;

; Line 202

As you can see, the call to EnableMenuItem is five bytes of code, which when added to 0642 gives you the 0647. Therefore, you can be assured that the parameter validation problem happened at this line.

Figure 7 BADAPP.DEF

NAME BADAPP

EXETYPE WINDOWS

STUB 'WINSTUB.EXE'

CODE MOVEABLE DISCARDABLE LOADONCALL

DATA MOVEABLE MULTIPLE PRELOAD

HEAPSIZE 4096

STACKSIZE 12288

SEGMENTS

_TEXT MOVEABLE PRELOAD DISCARDABLE

DDE_TEXT MOVEABLE LOADONCALL DISCARDABLE

REPS_TEXT MOVEABLE LOADONCALL DISCARDABLE

OFFICE_TEXT MOVEABLE LOADONCALL DISCARDABLE

FLAGS_TEXT MOVEABLE LOADONCALL DISCARDABLE

BATCH_TEXT MOVEABLE LOADONCALL DISCARDABLE

DATE_TEXT MOVEABLE LOADONCALL DISCARDABLE

MEMORY_TEXT MOVEABLE LOADONCALL DISCARDABLE

HEADERS_TEXT MOVEABLE LOADONCALL DISCARDABLE

GETFONT_TEXT MOVEABLE LOADONCALL DISCARDABLE

STATUS_TEXT MOVEABLE LOADONCALL DISCARDABLE

COLORS_TEXT MOVEABLE LOADONCALL DISCARDABLE

Generating your own parameter validation routines for every function in your Windows-based application is a great idea. If you verified that every function took on legal, valid parameters, you could quickly isolate a lot of problems, and also be assured that a function is completely stable. Life would be easier if there were Windows APIs that determine whether or not pointers are valid. The Windows 3.1 SDK provides six new functions, IsBadReadPtr, IsBadWritePtr, IsBadHugeReadPtr, IsBadHugeWritePtr, IsBadCodePtr, and IsBadStringPtr, to do this. To validate GDI and USER handles, you can use ToolHelp to walk USER and GDI’s local heaps, checking each object until the hHandle field of the LOCALENTRY structure matches your handle. Once these fields match, check the wType field for validity. Note that this is a slow process. At the very least you should verify parameters of your own data types. Look at the following structure type:

typedef struct tagDATA

{

#ifdef PARAMVALIDATION

WORD wUniqueID;

#endif

int iType;

long lNum;

char szName[15];

} DATA;

Whenever you allocate this structure, fill in the wUniqueID with a number reserved for that structure type. You can verify that the wUniqueID is valid when you pass the structure, and you can check each of the normal elements to make sure that iType has a valid number in it (for example, iType may only be allowed to be 0 through 10). Of course, proper prototyping can cause warnings for mismatched parameters, but prototyping cannot verify that 0 £ iType £ 10.

By requiring that all aspects of any parameter are acceptable, you can keep the bugs in the same modules that they originate from.

I have used almost every type of debugging tool during my development career. Although the CodeView debugger for Windows is great, and WDEB386 has its place, building your own tool and debugging techniques can sometimes find bugs you would never have found otherwise. Use your imagination. If you have a really nifty debugging technique that you’d like to share, send it to me in care of this magazine. I’d really like to hear about it.

1For ease of reading, "Windows" refers to the Microsoft Windows operating system. Windows is a trademark that refers only to this Microsoft product.

2For ease of reading, "MS-DOS" refers to the Microsoft MS-DOS operating system. MS-DOS is a trademark that refers only to this Microsoft product.

3For ease of reading, "CodeView" refers to the Microsoft CodeView debugger for Windows. CodeView is a trademark that refers only to this Microsoft product.