Under the Hood, MSJ September 1999

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

September 1999

Under the Hood

Code for this article: Sept99Hood.exe (5KB)
Matt Pietrek does advanced research for the NuMega Labs of Compuware Corporation, and is the author of several books. His Web site at http://www.tiac.net/users/mpietrek has a FAQ page and information on previous columns and articles.

In recent columns for MSJ (June 1999), I've discussed COM type libraries and database access layers such as ActiveX® Data Objects (ADO) and OLE DB. Longtime readers of my MSJ writings (both of them) probably think I've gone soft. To redeem myself, this month I'll tour part of the Windows NT® loader code where the operating system and your code come together. I'll also demonstrate a nifty trick for getting loader status information from the loader, and a related trick you can use in the Developer Studio® debugger.
      Consider what you know about EXEs, DLLs, and how they're loaded and initialized. You probably know that when a C++ DLL is loaded, its DllMain function is called. Think about what happens when your EXE implicitly links to some set of DLLs (for example, KERNEL32.DLL and USER32.DLL). In what order will those DLLs be initialized? Is it possible for one of your DLLs to be initialized before another DLL that you depend on? The Platform SDK has this to say under the "Dynamic-Link Library Entry-Point Function" section.

      Your function should perform only simple initialization tasks, such as setting up thread local storage (TLS), creating synchronization objects, and opening files. It must not call the LoadLibrary function, because this may create dependency loops in the DLL load order. This can result in a DLL being used before the system has executed its initialization code. Similarly, you must not call the FreeLibrary function in the entry-point function, because this can result in a DLL being used after the system has executed its termination code.
      Calling Win32® functions other than TLS, synchronization, and file functions may also result in problems that are difficult to diagnose. For example, calling User, Shell, and COM functions can cause access violation errors, because some functions in their DLLs call LoadLibrary to load other system components.

      Something I've learned firsthand is that the above documentation is still way too vague. For example, reading a registry key is a natural thing you'd want to do inside your DllMain function. It certainly qualifies as initialization. Unfortunately, in the right circumstances ADVAPI32.DLL isn't initialized before your DllMain code, and the registry APIs will just fail.
      Given the stern warning about using LoadLibrary in the documentation, it's especially interesting that the Windows NT USER32.DLL explicitly ignores the preceding advice. You may be aware of a Windows NT only registry key called AppInit_Dlls that loads a list of DLLs into each process. It turns out that the actual loading of these DLLs occurs as part of USER32's initialization. USER32 looks at this registry key and calls LoadLibrary for these DLLs in its DllMain code. A little thought here reveals that the AppInit_Dlls trick doesn't work if your app doesn't use USER32.DLL. But I digress.
      My point in bringing this up is that DLL loading and initialization is still a gray area. In most cases, a simplified view of how the OS loader works is sufficient. In those oddball 5 percent of cases, however, you can go nuts unless you have a more detailed working model of how the OS loader behaves.

Load 'er Up!
      What most programmers think of as module loading is actually two distinct steps. Step one is to map the EXE or DLL into memory. As this occurs, the loader looks at the Import Address Table (IAT) of the module and determines whether the module depends on additional DLLs. If the DLLs aren't already loaded in that process, the loader maps them in as well. This procedure recurses until all of the dependent modules have been mapped into memory. A great way to see all the implicitly dependent DLLs for a given executable is the DEPENDS program from the Platform SDK.
      Step two of module loading is to initialize all of the DLLs. Stop and ponder this. While the OS loader is mapping the EXE and/or DLLs into memory in step one, it's not calling the initialization routines. The initialization routines are called after all the modules have been mapped into memory. Key point: the order in which DLLs are mapped into memory is not necessarily the same as the order in which the DLLs are initialized. I've seen people look at the DLL mapping notifications as they appear in the Developer Studio debugger and mistakenly assume that the DLLs were initialized in that same order.
      In Windows NT, the routine that invokes the entry point of EXEs and DLLs is called LdrpRunInitializeRoutines, and it's worth taking a look at here. In my own work, I've stepped through the assembler code for LdrpRunInitializeRoutines many times. However, looking at a ream of assembler code isn't the best way to understand it. Therefore, I rewrote LdrpRunInitializeRoutines from Windows NT 4.0 SP3 in C++-like pseudocode, with the results shown in Figure 1. To be completely accurate, in NTDLL.DBG the routine name is __stdcall mangled to _LdrpRunInitializeRoutines@4. Also, in my pseudocode, unless a variable or structure name is prefixed with an underscore, it was a name I made up.
      LdrpRunInitializeRoutines is the final stop in the Windows NT loader code before an EXE's or DLL's specified entry point is called. (In the following discussion, I'll use "entry point" and "initialization routine" interchangeably.) This loader code executes in the process context that loaded the DLL-that is, it's not part of some special loader process. LdrpRunInitializeRoutines is called at least once during process startup to handle implicitly loaded DLLs. LdrpRunInitializeRoutines is also called every time one or more DLLs is dynamically loaded, usually because of a call to LoadLibrary.
      Each time LdrpRunInitializeRoutines executes, it seeks out and calls the entry point of all DLLs that have been mapped into memory, but not yet initialized. In examining the pseudocode, take note of all the extra code that provides trace output, even in the nonchecked builds of Windows NT. I'm referring to all the code that uses the _ShowSnaps variable and the _DbgPrint function. I'll come back to these players later.
      At a high level, the function breaks up into four distinct sections. The first portion of the code calls _LdrpClearLoadInProgress. This NTDLL function returns the number of DLLs that have just been mapped into memory. For example, if you called LoadLibrary on FOO.DLL and FOO had implicit links to BAR.DLL and BAZ.DLL, _LdrpClearLoadInProgress would return 3 since three DLLs were mapped into memory.
      After the number of DLLs to be concerned with is known, LdrpRunInitializeRoutines calls _RtlAllocateHeap (also known as HeapAlloc) to get memory for an array of pointers. In the pseudocode I've called this array pInitNodeArray. Each pointer in pInitNodeArray will eventually point to a structure containing information about the newly loaded (but not yet initialized) DLL.
      In the second part of LdrpRunInitializeRoutines, the code digs into internal process data structures to obtain a linked list containing each of the newly loaded DLLs. As the code iterates through the linked list, it checks to see if the loader has somehow seen this DLL before (not likely). It also checks to ensure that the DLL has an entry point. If both tests are passed, the code appends the module information pointer to the pInitNodeArray. The pseudocode refers to the module information as pModuleLoaderInfo. Note that it's entirely possible for a DLL to not have an entry point-for example, a resource-only DLL. Thus, the number of entries in pInitNodeArray may be fewer than the value returned earlier by _LdrpClearLoadInProgress.
      The third (and largest) section of LdrpRunInitializeRoutines is where things really start to happen. The code's mission here is to enumerate through each element in pInitNodeArray and call the entry point. Because of the very real possibility that a DLL's initialization code may fault, the entire third section of code is surrounded by a __try block. This is why a dynamically loaded DLL can fault in its DllMain without bringing the whole process down.
      Iterating through an array and calling an entry point for each node should be a small task. However, some relatively obscure features of Windows NT add to the complexity. For starters, consider whether the process is being debugged by a Win32 debugger such as MSDEV.EXE. Windows NT has an option that allows you to suspend a process and send control to the debugger before a DLL is initialized. This feature is on a per-DLL basis, and is enabled by adding a string value (BreakOnDllLoad) to a registry key with the name of the DLL (for instance, FOO.DLL). See the pseudocode comment above the call to _LdrQueryImageFileExecutionOptions in Figure 1 for more information.
      Another bit of extra code that may execute before a DLL's entry point invocation is the TLS initialization. When you declare TLS variables using __declspec(thread), the linker includes data that causes this condition to be triggered. Right before the DLL's entry point is called, LdrpRunInitializeRoutines checks to see if a TLS initialization is necessary and, if so, calls _LdrpCallTlsInitializers. More on this later.
      The moment of truth finally comes when LdrpRunInitializeRoutines calls the DLL's entry point. I deliberately left this part of the pseudocode in assembly language. You'll see why later. The crucial instruction is CALL EDI. Here, EDI points to the DLL's entry point, which is specified in the DLL's PE header. When CALL EDI returns, the DLL in question has completed its initialization. For DLLs written in C++, this means that the DllMain code has executed its DLL_PROCESS_ATTACH code. Also, note the third parameter to the entry point, normally referred to as pvReserved. In truth, this parameter is nonzero for DLLs that the EXE implicitly links to directly or through another DLL. The third parameter is zero for all other DLLs (that is, DLLs loaded as a result of a LoadLibrary call).
      After the DLL entry point is invoked, LdrpRunInitializeRoutines does a sanity check to make sure the DLL entry point code was defined properly. The loader code looks at the stack pointer (ESP) value from before and after the entry point call. If they're different, something's wrong with the DLL's initialization function. Since most programmers never define the real DLL entry point function, this scenario rarely happens. However, when it does, you're informed of the problem via an onerous dialog (see Figure 2). I had to use a debugger and modify a register value at just the right spot to produce this dialog.

Figure 2 An Invalid DLL Entry Point

      Figure 2 An Invalid DLL Entry Point

      Following the stack check, LdrpRunInitializeRoutines checks the return code from the entry point routine. For C++ DLLs, this is the value returned from DllMain. If the DLL returned 0, it usually means something is wrong and that the DLL doesn't want to remain loaded. When this happens, you get the dreaded "DLL Initialization Failed" dialog.
      The final portion of the third section of LdrpRunInitializeRoutines occurs after all the DLLs have been initialized. If the process EXE itself has TLS data, and if the implicitly linked DLLs are being initialized, the code calls _LdrpCallTlsInitializers.
      The fourth (and final) section of LdrpRunInitializeRoutines is the cleanup code. Remember earlier, when _RtlAllocateHeap created the pInitNodeArray? This memory needs to be freed, which occurs inside a __finally block. Even if one of the DLLs faults in its initialization code, the __try/__finally code ensures that _RtlFreeHeap is called to free pInitNodeArray.
      This ends our tour of LdrpRunInitializeRoutines, so let's now look at some side topics that the code presents.

Debugging Initialization Routines
      Every once in a while I come across a problem where a DLL is faulting in its initialization code. Unfortunately, the fault could be from any one of several DLLs, and the operating system doesn't tell me which DLL is the culprit. In these circumstances you can get sneaky and use a debugger breakpoint to narrow down the problem.
      Most debuggers blithely skip past the initialization of statically linked DLLs. They're focused on getting you to the first instruction or first line in your EXE. However, knowing what LdrpRunInitializeRoutines looks like, you can set a breakpoint on the CALL EDI instruction where execution goes to the DLL entry point. Once the breakpoint is set, each time a DLL is about to get its DLL_PROCESS_ATTACH notification you'll stop in NTDLL at the CALL instruction. Figure 3 shows what this looks like in the Visual C++® 6.0 IDE (MSDEV.EXE).

Figure 3 Setting a Breakpoint on CALL EDI

      Figure 3 Setting a Breakpoint on CALL EDI

If you choose to step into the CALL, you'll end up at the first instruction of the entry point of the DLL. It's important to understand that this code is almost never code you write. Rather, it's usually code in the runtime library that does its setup work and then calls your initialization code. For example, in a DLL written in Visual C++, the entry point is _DllMainCRTStartup, which is in CRTDLL.C. Without symbol tables or source code, what you'll see in the MSDEV assembly window will look something like Figure 4.

Figure 4 Stepping into the CALL

Figure 4 Stepping into the CALL

      Usually my debugging process follows a predictable pattern. Step one is to figure out which DLL is faulting. Do this by setting the aforementioned breakpoint, and make one instruction step into each DLL as it initializes. Using the debugger, figure out which DLL you're in, and write it down. One way to do this is to use the memory window to look on the stack (ESP) and obtain the HMODULE of the DLL you've entered.
      After you know which DLL you've entered, let the process continue (Go). In short order, the breakpoint should be hit again for the next DLL. Repeat this as often as necessary until you identify the problem DLL. You'll recognize the problem DLL because it will be called to initialize, but the process terminates before the initialization code returns.
      Step two is to drill into the faulting DLL. If the offending DLL is one that you have source for, try setting a breakpoint on your DllMain code, then let the process run to see if your breakpoint is hit. If you don't have source, just run the process. Your breakpoint on the CALL EDI instruction should still be in place from before. Keep running until you get to the one where the initialization faults. Step into this entry point, and keep stepping until you can ascertain the problem. This may require stepping through a lot of assembly code! I never said this was easy, but sometimes it's the only way to hunt the problem down.
      Finding the CALL EDI instruction can be tricky (at least with the current Microsoft® debuggers). You can see why I deliberately left this part of the pseudocode in assembler. For starters, you'll definitely need to have the correct NTDLL.DBG in your SYSTEM32 directory, alongside NTDLL.DLL. The debugger should automatically load the symbol table when you begin stepping through your program.
      Using the assembly window in Visual C++, you can (in theory) goto an address using a symbolic name. Here, you'd want to go to _LdrpRunInitializeRoutines@4 and then scroll down until you see the CALL EDI instruction. Unfortunately, the Visual C++ debugger doesn't recognize NTDLL symbol names unless you're already stopped in NTDLL.DLL.
      If you happen to know the address of _LdrpRunInitializeRoutines@4 (for instance, 0x77F63242 in Windows NT 4.0 SP 3 for Intel), you can type that in and the assembly window will happily display it. Heck, the IDE will even show you that it's the start of a function called _LdrpRunInitializeRoutines@4. If you're not a debugger guru, the failure to recognize the symbol name is extremely confusing. If you are a debugger nut like me, it's extremely annoying because you know what's causing the problem.
      WinDBG from the Platform SDK is a little better about recognizing symbol names. Once you've started the target process, you can set a breakpoint on _LdrpRunInitializeRoutines@4 using its name. Unfortunately, the first time you execute the process, execution blows past _LdrpRunInitializeRoutines@4 before you get a chance to set your breakpoint. To remedy this, start WinDBG, make one instruction step, set the breakpoint, stop debugging, and remain in the debugger. You can then restart the debuggee and the breakpoint will be hit on every invocation of _LdrpRunInitializeRoutines@4. This same trick works in the Visual C++ debugger.

What's This ShowSnaps Thing?
      One of the first things that jumped out at me when I looked at the LdrpRunInitializeRoutines code was the _ShowSnaps global variable. Now's a good time to briefly divert to the subject of the GlobalFlag and GFlags.EXE.
      The Windows NT registry contains a DWORD value that influences certain behaviors of system code. Most of these modifications are heap- and debugging-related. The GlobalFlag value of the registry key

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager

is a set of bitfields. Knowledge Base article Q147314 describes most of the bitfields, so I won't go into them here. In addition to this systemwide GlobalFlag value, individual executables can have their own distinct GlobalFlag value. The process-specific GlobalFlag value is found under

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT \CurrentVersion\Image File Execution Options\imagename

where imagename is the name of an executable (for instance, WinWord.exe). All of these documentation-challenged bitfields and highly nested resource keys scream out for a program to simplify it all. Wouldn't you know it, Microsoft has just such a program.

Figure 5 GFlags.EXE

      Figure 5 GFlags.EXE

       Figure 5 shows GFlags.EXE, which comes in the Windows NT 4.0 Resource Kit. Near the top-left of the GFlags window are three radio buttons. Selecting either of the top two (System Registry or Kernel Mode) lets you make changes to the global Session Manager value of GlobalFlags. If you select the third radio button (Image File Options), the set of available option checkboxes shrinks dramatically. This is because some of the GlobalFlag options only affect kernel mode code and don't make sense on a per-process basis. It's important to note that most of the kernel mode-only options assume you're using a system-level debugger such as i386kd. Without such a debugger to poke around or receive the output, there's not much use in enabling these options.
      To tie this back to the subject of _ShowSnaps, enabling the Show Loader Snaps option in GFlags causes the _ShowSnaps variable to be set to a nonzero value in NTDLL.DLL. In the registry, this bit value is 0x00000002, which is #defined as FLG_SHOW_LDR_SNAPS. Luckily, this bitflag is one of the GlobalFlag values that can be enabled on a per-process basis. The output can be quite voluminous if you enable it systemwide.

Examining ShowSnaps Output
      Let's take a look at what sort of output enabling Show Loader Snaps produces. It turns out that other parts of the Windows NT loader that I haven't discussed also check this variable and emit additional output. Figure 6 shows some abbreviated output from running CALC.EXE. To get the text, I first used GFlags to turn on Show Loader Snaps for CALC.EXE. Next, I ran CALC.EXE under the control of MSDEV.EXE and captured the output from the debug pane.
      In Figure 6, note that all the output originating in NTDLL is preceded by an LDR: prefix. Other lines in the output (for instance, "Loaded symbols for XXX") were inserted by the MSDEV process. In looking at the LDR: output, there's a wealth of information. For example, as the process starts, the complete path to the EXE is given, along with the current directory and search path.
      As NTDLL loads each DLL and fixes up imported functions, you'll see lines like this:

LDR: ntdll.dll used by SHELL32.dll LDR: Snapping imports for SHELL32.dll from ntdll.dll

The first line decrees that SHELL32.DLL links to APIs in NTDLL. The second line shows that the imported NTDLL APIs are being "snapped." When an executable module imports functions from another DLL, an array of function pointers resides in the importing module. This array of function pointers is known as the IAT. One of the loader's jobs is to locate the addresses of the imported functions and punch them into the IAT. Hence, the term "snapping" in the LDR: output.
      Another interesting set of lines in the output shows bound DLLs being handled:

LDR: COMCTL32.dll bound to KERNEL32.dll LDR: COMCTL32.dll has correct binding to KERNEL32.dll

In previous columns, I've talked about the binding process done by BIND.EXE or the BindImageEx API in IMAGEHLP.DLL. Binding an executable to a DLL is the act of looking up the address of the imported APIs and writing them to the importing executable. This speeds up the loading process since the imported addresses don't have to be looked up at load time.
      The first line in the above output indicates that the COMCTL32 has bound against APIs in KERNEL32.DLL. The second line indicates that the bound addresses are correct. The loader does this by comparing timestamps. If the timestamps don't match, the binding is invalid. In this case, the loader has to look up the imported addresses just as if the executable hadn't been bound in the first place.

TLS Initialization
      I'll finish up this column by showing pseudocode for one other routine. In LdrpRunInitializeRoutines, right before the module's entry point is called, NTDLL checks to see if the module needs TLS initialization. If so, it calls LdrpCallTlsInitializers. Figure 7 shows my pseudocode for this routine.
      The code in LdrpCallTlsInitializers is simple enough. Located in the PE header is an offset (RVA) to an IMAGE_TLS_DIRECTORY structure (defined in WINNT.H). The code calls RtlImageDirectoryEntryToData to obtain a pointer to this structure. Within the IMAGE_TLS_DIRECTORY structure is a pointer to an array of callback function addresses to be invoked. These functions are prototyped as IMAGE_TLS_CALLBACK functions, which are defined in WINNT.H. Not coincidentally, the TLS initialization callbacks happen to look just like a DllMain function. For what it's worth, when using __declspec(thread) variables, Visual C++ emits data that causes this routine to be invoked. However, no actual callbacks are currently defined by the runtime library, so the array of function pointers is a single NULL entry.

Conclusion
      This wraps up my coverage of Windows NT module initialization. Obviously, I have skipped or skimmed over a lot of related material. For example, what is the algorithm for determining the order in which the modules will be initialized? The algorithm Windows NT uses has changed at least once, and it would be nice to have a Microsoft technical note that at least gives some guidelines. Likewise, I haven't covered the mirror image topic: module unloading. However, I hope this glimpse into the inner workings of the Windows NT loader has provided you with material for further exploration.

Have a suggestion for Under the Hood? Send it to Matt at mpietrek@tiac.com or http://www.tiac.com/users/mpietrek.

From the September 1999 issue of Microsoft Systems Journal.