Bugslayer, MSJ, October 1998

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

October 1998

Download Oct98BugSlayer.exe (155KB)

John Robbins is a software engineer at NuMega Technologies Inc. who specializes in debuggers. He can be reached at john@jprobbins.com.

The other day my good friend Jim Austin and I were swapping debugging war stories, trying to outdo each other with our debugging prowess. When Jim started bragging about finding and fixing a nasty bug in my code, I quickly changed the topic: What are the worst problems to debug? It only took a microsecond for us to agree that multithreaded deadlocks were the absolute worst. Even if you think you have planned for every contingency, your application stops dead when you least expect it.
      Jim and I started comparing notes on what techniques have worked for us in the past, and we laughed when we realized that luck was number one. As Jim pointed out, luck does not ship good code on time. We had a few more scientific techniques for the design phase, but during the debugging phase, when the code is actually deadlocking, we only had luck to go on. Jim then threw down the gauntlet and issued this challenge: "You're the Bugslayer guy, figure something out."
      This month I'll discuss a few tricks that have worked for me when doing multithreaded programming. I developed a utility, DeadlockDetection, that tells you where the deadlock occurred in your code. I designed DeadlockDetection with a great deal of extensibility in mind because I want to do more with it in a future column. Also, the techniques I used can be applied to different bugslaying utilities, so don't be surprised if I use some of them in the future. After the tips and tricks, I will discuss the design requirements I had for DeadlockDetection, followed by the key implementation details.

Multithreading Tips and Tricks
      So you want to add multithreading to your application? The first tip is very simple: don't. While I am being somewhat facetious, the idea is to only multithread if you absolutely have to. Starting another thread means that you can easily add a month to your development schedule and complicate your life considerably. Try to exhaust other design approaches before looking at multithreading.
      If you must multithread, try to keep it to very small, discrete pieces. The general rule of thumb I use is to not multithread your user interface. Stick to small pieces of work that generally are devoid of any user interface. For example, printing in the background is a good use of multithreading because the user can still use your application to get things done while the data is printing.
      Put your synchronization methods at the lowest level possible in your code. If you have a critical section for protecting a piece of data, put the EnterCriticalSection and LeaveCriticalSection around just the actual data access. Doing this ensures that you are indeed protecting the item you are supposed to protect. This will help you avoid an inadvertent deadlock because you have functions that are removed from the data-grabbing synchronization objects. One of the nastiest deadlock problems I ever caused was grabbing the synchronization object two functions above where I needed it.
      When you design your multithreaded system, always pretend that each thread is running with real-time priority on its own dedicated CPU. The idea is to find those critical areas where one thread will be waiting on something that another has locked. When the design is close to completion, it is a good idea to have all the developers sit down in a room with each thread assigned to a developer. Like a code walk-through, this is a thread walk-through. Each "thread person" executes their piece of the design. They only pay attention to the code in their thread and indicate when they are blocked and what they are blocked on. Others in the room can keep track of who is blocked and when they blocked, and start looking for deadlocks.
      Keep in mind that the operating system uses synchronization objects inside your process, so you can accidentally deadlock with those as well. The process-critical section and the infamous Win16 mutex are both synchronization objects that the operating system uses in your process. Additionally, message handling can cause deadlocks if you are not careful. If thread A is a UI thread and is waiting for a critical section currently owned by thread B, and if thread B does a SendMessage to an HWND in thread A, you will deadlock.
      Finally, the most important tip of all when it comes to multithreaded programming is to always test thoroughly on multiprocessor machines. This does not mean simply running your application through a few paces. I mean really using it. Even if your application runs perfectly on single-processor machines, a multiprocessor machine will turn up deadlocks you never thought possible.
      The best way to test your application is to have the team's developers running on multiprocessor machines. If you are a manager and you do not have any multiprocessor machines in your shop, stop reading right now and equip half of your developers and quality assurance testers with multiprocessor machines! If you are a developer, show this column to your manager and demand your multiprocessor machine!

DeadlockDetection Requirements
      As you probably noticed, I haven't provided any tips on what to do when you get that deadlock from out of the blue. So, here's a list of requirements for DeadlockDetection:

Show exactly where the deadlock happens in the user's code. A tool that only tells you that EnterCriticalSection is blocked does not help much. You want to get back to the address, and consequently the source and line, where the deadlock occurred so that it can be fixed quicker.
Show what synchronization object caused the deadlock.
Show what Windows® function is blocked and the parameters that were passed to the function. It helps to see timeout values and what other items were passed to the function.
Determine which thread caused the deadlock.
The utility must be lightweight so that it causes the least possible interference with the user's program. I will expand on this shortly.
The information output processing must be extensible. The information that is collected in a deadlock detection system can be processed in many ways, so you want to allow others to extend the information as they see fit.
It must be very easy for users to use the tool on their programs.
      One of the key points to keep in mind with a utility like DeadlockDetection is that it will definitely affect the behavior of the application it is observing. For you physics majors, this is like the Heisenberg uncertainty principle. DeadlockDetection can introduce deadlocks in your programs because the work that it does to gather the information will slow down your threads. I almost defined this as a feature because any time you can cause a deadlock in your code, you've identified a bug, which is the first step toward correcting it—better you, than your customer.

High-level Design Issues
      With these requirements, I had to figure out how to actually implement DeadlockDetection. First, I needed to decide how to get the information from the threading functions. I needed to intercept (or hook) the functions for which I wanted to record information. In my December 1997 column, I had an identical problem with the TraceSrvHelper program where I needed to hook all the calls to OutputDebugString to redirect their output to the machine-tracing system, TraceSrv. Since I had already solved this problem with a function called HookImportedFunctionsByName, I just reused it for the multithreaded function hooking. (HookImportedFunctionsByName is part of the ongoing BugslayerUtil.dll that I have been updating with each column.) Figure 1 lists all the functions that DeadlockDetection hooks.
      HookImportedFunctionsByName hooks the specified functions imported by a module by rewriting the module's Import Address Table to call other functions. Since I took the time back in December to make HookImportedFunctionsByName a little more generic, it can hook multiple functions that are imported from a specific module. While dusting off the function, I fixed a small bug: it originally assumed that the caller had done the work to determine if the module to patch was above the 2GB line. This check only needs to be done because of issues pertaining to Windows 95 and Windows 98. Now HookImportedFunctionsByName checks the module value and returns FALSE.
      Choosing this general approach means that DeadlockDetection can get the necessary information because each of the multithreading and synchronization functions will be calling directly into the code with all the information that I need. This will satisfy requirements 1 through 5. The code for hooking needs to be in the same address space as your program. Consequently, I chose to make DeadlockDetection a DLL that can do all of its work in DllMain, so all you need to do is call LoadLibrary on the DLL. This should be easy enough for everyone, so that takes care of requirement 7.
      Making DeadlockDetection as lightweight as possible is a pretty tough requirement. I tried to be efficient in coding everything, but that can only go so far. Figuring that you would know what types of synchronization objects are being used in your program, I grouped the object types so that you can specify what you want to hook. For example, if you are only concerned about deadlock problems on mutexes, you can process only those functions.
      To make DeadlockDetection even more useful and stir up less trouble, you can specify which sets of synchronization object functions to watch on the fly. Additionally, you can even turn DeadlockDetection on and off as many times as needed. You might even want to give your app an accelerator key or special menu option that toggles the whole DeadlockDetection system. Allowing this narrow scope will meet requirement 5 and help with number 7.
      The only requirement left is 6: making the output processing as extensible as possible. I wanted to make sure that I did not hardcode output because there are many interesting ways to slice and dice it. By keeping the main hooking and processing separate from the output code, I can get better code reuse because the only thing being changed, the output side, is much easier to develop than the core side. I called the output portions DeadlockDetection Extensions, or DeadDetExt for short. A DeadDetExt is simply a DLL that has several exported functions that DeadlockDetection looks for and calls when appropriate. This extensibility is important because I want to take advantage of it in the future.
      Now that I have laid out the requirements and shown you how I plan on meeting them, I want to show you how to use DeadlockDetection. If you understand the requirements and how to use this utility, it will be much easier to see how I implemented it.

Using DeadlockDetection
      The first step in using DeadlockDetection is to get DeadlockDetection.dll, its initialization file, and the appropriate DeadDetExt DLL in the same place. The initialization file is a simple INI file that, at a minimum, must specify the name of the DeadDetExt file to load. The following is a sample DeadlockDetection.ini file that loads the supplied TextFileDDExt.dll:

[Initialization] ; The only mandatory value, the name of the DeadDetExt ; file that will handle the output. ExtDll = "TextFileDDExt.dll" ; If this option is 1, DeadlockDetection will ; initialize in its DllMain so that logging can start ; at the earliest possible time. StartInDllMain = 0 ; If StartInDllMain is 1, this key specifies the ; initial options for DeadlockDetection. This is the ; VALUE of the DDOPT_* flags. ; InitialOpts = 0

      As you can see from some of the INI settings, Dead-lockDetection can initialize just by having LoadLibrary called on it. A good bugslayer idea would be to create a backdoor in your application initialization that calls LoadLibrary on the specified DLL name if it sees a special registry key or environment variable. This way you would not need conditional compilation and you have a means of getting other DLLs into your address space cleanly. Of course, all of this assumes that the DLLs you are loading are smart enough to take care of their initialization in their DllMains.
      If you want to initialize DeadlockDetection yourself, all you need to do is call OpenDeadlockDetection when appropriate. OpenDeadlockDetection takes a single parameter, the initial reporting options. Figure 2 lists all of the DDOPT_ flags. Of course, you will want to call OpenDeadlockDetection before your application starts creating threads so all the key information about the synchronization objects can be recorded.
      At any point, you can change the reporting options by calling SetDeadlockDetectionOptions. This takes the same ORd set of flags as the OpenDeadlockDetection function. To see what the current options are, call GetDeadlockDetectionOptions. You can change the current options as many times as you like during your program's execution. If you want to suspend and resume logging, call the SuspendDeadlockDetection and ResumeDeadlockDetection functions.
      Along with this month's source code, I wrote one DeadDetExt DLL, TextFileDDExt.dll. This is a relatively simple extension that records all the information to a text file. When you run DeadlockDetection with TextFileDDExt.dll, it creates a text file in the same directory as the executable program. The text file will use the name of the executable with a .dd extension. For example, if you run SimpTest.exe, the resulting file will be SimpTest.dd. Some sample output from TextFileDDExt.dll is shown in Figure 3. The output shows the information in this order:
The thread ID of the executing thread.
The return address to indicate which of the user's functions called the synchronization function.
Call or return indicator to see before and after actions on specific functions.
The return value of the function if reporting a function return.
The synchronization function name.
The parameter list for the synchronization function. Those items in brackets are the human-readable data. I primarily did string values, but it would be trivial to add more.
      When you run your application and it deadlocks, kill the process and view the output file to see the last synchronization item called. TextFileDDExt.dll keeps the file up to date by flushing the file buffers each time a WaitFor* function or EnterCriticalSection is called.
      A word of caution: if you turn on full logging of all functions, you can generate some extremely large files in no time. One of the test samples, MTGDI, can generate an 11MB file in a minute or two if you create a couple of threads.

Implementation Highlights
      One of my key goals in implementing DeadlockDetection was to make the whole thing as data and table-driven as possible because, when you step back and look at what the DLL does, it is almost identical for each function in Figure 1. The hooked function gets called, determines if its class of functions is being monitored, calls the real function, and (if logging is on for that class) it logs the information and returns. Since I had to write a bunch of hook functions, I also wanted to make it simple because this is an area where many bugs could creep in.
      The best way to show this simplicity is to talk about writing a DeadDetExt DLL. A DeadDetExt DLL must have three exported functions. The first two, DeadDetExtOpen and DeadDetExtClose, are self-explanatory. The interesting function is DeadDetProcessEvent, which is called by each hook function when there is information to write. DeadDetProcessEvent takes a single parameter, a pointer to a DDEVENTINFO structure:

typedef struct tagDDEVENTINFO { // The function identifier. Each function has a // unique value. eFuncEnum eFunc; // The pre-call or past-call indicator. ePrePostEnum ePrePost; // The return address. This is so the calling // function can be found. DWORD dwAddr; // The thread ID of the calling thread. DWORD dwThreadId; // The return value for post calls. DWORD dwRetValue; // The parameter information. When accessing the // parameters, treat them as read only. DWORD dwParams; } DDEVENTINFO, * LPDDEVENTINFO;

      The entire output for the single function that appears in Figure 3 was constructed with the information in the DDEVENTINFO structure. While most of the fields in DDEVENTINFO are self-explanatory, the dwParams field needs a special mention. This is really a pointer to the parameters as they appear in memory. DeadlockDetection is only intercepting __stdcall functions. If you read my June 1998 column, you should see that dwParams is each of the parameters in left-to-right order. I applied a little creative casting to make it easy to convert dwParams.
      In DeadlockDetection.h, I provide typedefs that describe each of the intercepted function parameters. For example, if eFunc was eWaitForSingleObjectEx, then you would cast dwParams to a LPWAITFORSINGLEOBJECTEX_ PARAM to get the parameters. To see all of this in action, check out the TextFileDDExt.dll code in this month's source code.
      While output processing is relatively easy, the hard part is gathering the information. I wanted DeadlockDetection to hook the synchronization functions in Figure 1, but to appear exactly as if the real function had been called. I also wanted to get the parameters and the return value and to write the hook functions in C/C++ easily. It took quite a while with the debugger and the disassembler before I got it right.
      My initial step was to write all the hook functions so they were just pass-through functions and called the real function directly. This worked great. My next step was to get the parameters and the return value for the functions into local variables. While getting the value returned from the real function was simple, I realized that I did not have a clean way to get the return address with my C/C++ hook function. I needed the DWORD right before the current stack pointer. Unfortunately, in straight C/C++ the function prolog had already done its magic by the time I could get control, so the stack pointer had already moved away from where it needed to be.
      You might think that the stack pointer is just offset by the number of local variables, but that is not always the case. The Visual C++® compiler does a pretty good job of optimizing so that it is not in the same place with different optimization flags set. While you might declare a variable as a local, the compiler can optimize it down to a register so it does not even appear on the stack.
      I needed a guaranteed way to get the stack no matter what optimizations were set. At this point, I started thinking naked—no, not me without clothes, but declaring the hook functions as __declspec(naked) and creating my own prolog and epilog code. With this approach, I would have complete control over ESP no matter what optimizations were used. Additionally, getting the return address and parameters is a snap because they are at ESP + 04h and ESP + 08h, respectively. Keep in mind that I am not doing anything out of the ordinary with the prolog and epilog code, so I still do the usual PUSH EBP and MOV EBP , ESP for prolog and MOV ESP , EBP and POP EBP for epilog.
      One downside to doing my own prolog and epilog is that I need to use inline assembler. Since I do not know Alpha assembler as well as I know Intel assembler, I cannot provide a version of DeadlockDetection that works on the Compaq (née Digital) Alpha. If you can port DeadlockDetection to the Alpha and send me the code, I will report about it in a future column and post the code on my Web site.
      Since each of the hook functions was going to be declared as __declspec(naked), I made a couple of macros to handle the prolog and epilog: HOOKFN_PROLOG and HOOKFN_ EPILOG. I also went ahead and declared some common local variables that all hook functions would need in HOOKFN_PROLOG. These included the last error value, dwLastError, and the event information structure to pass to the DeadDetExt DLL, stEvtInfo. The dwLastError is yet another thing that I needed to watch when intercepting functions.
      The Windows API can return a special error code through SetLastError to provide more information if a function fails. This error code can be a real boon because it tells you why an API failed. For example, if GetLastError returns 122, then you know that the buffer parameter was too small. WINERROR.h contains all the error codes the operating system returns. The problem for hook functions is that they can reset the last error as part of their processing. This can wreak havoc if your application relies on this behavior.
      If you call CreateEvent and want to see if the returned handle was truly created or just opened, CreateEvent sets the last error to ERROR_ALREADY_EXISTS if it just opened the handle. Since the cardinal rule of intercepting functions is that you cannot change the behavior of the application, I needed to call GetLastError immediately after the real function call so my hook function could properly set the last error code that the real function returned. The rule in the hook function is that you need to call GetLastError right after you call the real function, and then call SetLastError as the last thing before leaving the hook function.
      At this point, I thought I was pretty much done except for the testing. The first thing I found was a bug: I wasn't preserving ESI and EDI across the hook call even though the documentation on using the inline assembler explicitly stated this. After I fixed this, everything seemed to work fine. When I started doing register comparisons on before, during, and after cases, I noticed that I was not returning the real functions in the EBX, ECX, EDX, and worse yet, the flags registers. While I did not see any problems and the documentation said that those registers did not need to be preserved, I was concerned that there might be something I had not tested. I declared the REGSTATE structure to hold the register values after the real function call so I could put them in that state when my hook function returned. This meant that I needed to create two additional macros, REAL_FUNC_PRE_CALL and REAL_FUNC_POST_ CALL, that must be used around the real call the hook function makes.
      After a little more testing, I found another problem: in release builds with full optimizations, I would have a crash every once in a while. I finally tracked that down to the effect of the optimizer on some of my hook functions. The optimizer was doing the right thing, but not when I needed it to. I was very careful about the register usage in my hook functions and only used EAX or stack memory directly. I found the _DEBUG build code sequence was

C745E802000000 mov dword ptr [ebp-018h],00000002h C745EC02000000 mov dword ptr [ebp-014h],00000002h

and the optimizer was turning it into the following:

6A02 push 002h 5B pop ebx 895DE4 mov dword ptr [ebp-01Ch],ebx 895DE8 mov dword ptr [ebp-018h],ebx

      It is easy to see in the second snippet that the POP into EBX was trashing the register. To avoid the optimizer doing things like this behind my back, I turned it off for all hook function files by placing a

#pragma optimize( "", off )

at the top of each file. This also made debugging easier because very similar code is generated for both release and debug builds.
       Figure 4 shows the final version of DD_Funcs.h, which is the internal header file where all of the special hook function macros are declared. The comment at the top of the file has a sample hook function where all the macros need to be called. I strongly encourage you to step through the SimpTest example that is part of the source code. Make sure that you watch an entire function call at the assembler level because that is the only way you will see the whole thing work.
      There are a couple of other minor points that I want to make about the implementation. First, I made sure that DeadlockDetection.dll and its supporting DLLs, BugslayerUtil.dll and TextFileDDExt.dll, did not use any runtime library resources from the user's application. Since I need to do some memory allocations, if I were to use the user's heap, I would be affecting the application a little too much and the program could do an overrun into DeadlockDetection's data. All of the DLLs that I use link to their own static versions of the RTL and, if need be, use a separate private heap to handle some memory allocations.
      Second, DeadlockDetection is always called by your application even if it is suspended. Instead of hooking and unhooking on the fly, I leave the functions hooked and look at some internal flags to determine how the hook should behave. This makes it easy to toggle different function-logging on the fly, but it does add a bit to the overhead of your application. It seemed error-prone to allow hooking and unhooking on the fly.
      Finally, DeadlockDetection hooks the functions out of a DLL when brought into your program through LoadLibrary. However, it can only gain control after that DLL's DllMain has executed, so if there are any synchronization objects created or used during DllMain, DeadlockDetection can miss them.

What's Next For DeadlockDetection?
      As I mentioned at the beginning of the column, DeadlockDetection is certainly not finished. The first thing that it needs is far more testing. This is the biggest, most time-consuming, and most difficult utility I have written for the Bugslayer column. Second, DeadlockDetection requires further fine-tuning for performance and reliability. Also, additional elements need to be implemented to make DeadlockDetection more useful and powerful. If there are readers who want to take on any of these challenges, go ahead and implement some of the following ideas and I will post the best solutions on my Web site http://www.op.net/docs/RFCs/ien-137 for those who want to know the real story. Andrew Greene jokingly thought that it was a test to see how many folks would be anal-retentive enough to correct me.
      Finally, it seems I am becoming the keeper of all weird IMAGEHLP.dll symbol engine problems. Gene Brumm and Dave Austin found that IMAGEHLP.dll will not work on PDB files if MSPDB50.dll is not in the path.

Tips
      Tip 13 From Dave Angel: in your June column, you mentioned using the debugger as a code coverage tool. Your debugger makes a useful performance tool as well. Run your application under the debugger and when it starts a time-consuming operation, keep hitting the Debug|Break menu to break into your application at regular intervals, and note which function the debugger stops in. This is the same technique that a sampling profiler uses, albeit at a faster pace.
      If you are doing a multithreaded application, make sure to check the current location for each thread. If one function (or a group of related functions) is your bottleneck, this will become obvious very quickly. Sometimes it's helpful to check the top few levels of the call stack, because even if a single function isn't the culprit, children of the function may be, and some attention to the common logic is warranted.
      The bottom line is that you can use this approach to do a sanity check on your program. If you get a different function each time you halt, or if you get the functions you expect, then you understand the code. But if you see something unusual, it's time for a closer look. That closer look may be with a formal performance tool, or it may be with some custom measurement code you insert yourself.
       Tip 14 Since a bunch of managers are going to be spending a lot of money thanks to this column (because they need to go out and buy multiprocessor machines), here is a tip for them I learned through real-world commercial development. One thing that has worked well for me is having two-week code freezes before milestones. When you limit code freezes to a week, you end up fixing bugs and never actually freeze. If you schedule two-week code freezes, the idea is to spend the first week at absolute feature freeze, but allow your developers to fix bugs. The second week is total code freeze for everyone. When you do this, you are scheduling bug fix time and giving yourself enough space to really put your release through the ringer. When I started doing this, my milestone alphas and betas became significantly more stable and useful for external testers.

Have a tricky issue dealing with bugs? Send your questions or bug slaying tips via email to John Robbins: john@jprobbins.com

From the October 1998 issue of Microsoft Systems Journal.