This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.


December 1998

Microsoft Systems Journal Homepage

Under the Hood

Download dec98hood.exe (94KB)

Matt Pietrek does advanced research for the NuMega Labs of Compuware Corporation, and is the author of several books. His Web site at http://www.tiac.net/users/mpietrek has a FAQ page and information on previous columns and articles.
 
One of the coolest new features in Visual C++® 6.0 is the /DELAYLOAD linker option. Executables that use the /DELAYLOAD option don't implicitly link to the DLLs that you specify with /DELAYLOAD. Instead, the DLL isn't loaded until the first call to one of its functions. While you can achieve a similar effect by using LoadLibrary and GetProcAddress, the new /DELAYLOAD capability is much more seamless. No source code changes are required; just make a few changes to the linker options in your makefile or project settings and you're finished.
What's the advantage of waiting to load a DLL until the first time it's used? Two reasons immediately come to mind. First, your program may need to run on multiple Win32®-based platforms, and you use a DLL that's not available on one of the platforms. It's not enough to check which platform you're running on and avoid calling the APIs that aren't available. If you implicitly linked with that DLL's import library, normally your program wouldn't run in environments that don't have that DLL.
For instance, consider PSAPI.DLL, which I've written about in previous columns. PSAPI.DLL doesn't exist on Windows® 9x, so even if I didn't call its APIs while running on Windows 9x, the Windows 9x loader would still refuse to load it since it can't resolve the implicit links to the PSAPI APIs. The /DELAYLOAD option is perfect for this scenario. If you don't call into the DLL, it won't be loaded.
The second obvious benefit of using /DELAYLOAD is reduced load time and initialization. Let's say your program wants to call APIs in some DLL, but only in relatively rare circumstances. If you defer loading those DLLs unless they're actually called, you'll cut down on the loader's work. Likewise, the initialization code in those DLLs won't execute unless the DLL is actually used.
In fact, the Microsoft® linker (LINK.EXE) now makes use of /DELAYLOAD for just this purpose. A little-known feature of the linker is that it can disassemble OBJs and executables (for example, link -dump /disasm foo.exe). LINK.EXE relies on a separate DLL (for instance, MSDIS100.DLL) to do the actual disassembly. The vast majority of the time LINK.EXE isn't used for disassembly. You'll find that LINK.EXE from Visual C++ 6.0 uses the /DELAYLOAD capability to load the disassembly DLL only as needed.
In my September 1998 column, I briefly touched upon /DELAYLOAD and said that it was a feature of Windows NT 5.0. I was wrong (not the first time that's happened). Using /DELAYLOAD requires no operating system support. In fact, I'm surprised that Microsoft or another compiler vendor didn't come up with this technique long ago. What confused me was that a new PE file DataDirectory #define (IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT) was added to WINNT.H. The operating system itself doesn't use this DataDirectory entry. However, this entry is the only way that tools such as DUMPBIN or my PEDUMP program could locate the delay load information reliably.
When I first experimented with executables built using /DELAYLOAD, I was struck by two things. First, to my knowledge this is the first time that the Microsoft linker has actively generated code. As I described in my July 1997 column, the linker typically combines raw data from OBJ and LIB files to produce the sections of the final executable. When using /DELAYLOAD, the linker actually generates small stubs of code and inserts them alongside the compiler-generated code.
The second thing that intrigued me about /DELAYLOAD executables is the similarity to Visual Basic®. When you call an API from Visual Basic, the executable doesn't implicitly link to the target DLL. Rather, the Visual Basic code generator creates data describing the API call name and DLL, then makes a jump to a small stub. The first time the stub is called, it uses the data to call LoadLibrary and GetProcAddress. The address returned by GetProcAddress is stored in the stub so that future calls go directly (almost) to the API's code.
The /DELAYLOAD mechanism is conceptually similar to what Visual Basic does in that it causes a small bit of data and a code stub to be generated. Similarly, the stub calls LoadLibrary and GetProcAddress. However, the actual implementation of the stub is quite different, and subsequent calls to the API are even faster than the Visual Basic method. (Under Visual Basic, the call to an API still goes through a JMP instruction to get to the API's code.)
Having gushed long enough about /DELAYLOAD, let's see how you'd use it in your own projects. To implement /DELAYLOAD, two changes are necessary to your linker options. First, you'll need to add the DELAYIMP.LIB file to your list of libraries. DELAYIMP.LIB contains the code that calls LoadLibrary and GetProcAddress. DELAYIMP.LIB can be found in the Visual C++ 6.0 \LIB directory.
The second change to the linker command line is to add /DELAYLOAD:<DLLNAME>, where <DLLNAME> is the name of the DLL that you want to be loaded, such as:
 /DELAYLOAD:COMCTL32.DLL 
If you need to delay load multiple DLLs, it's fine to use multiple /DELAYLOAD fragments. Just because you're delay-loading the DLL doesn't mean that you can exclude the DLL's LIB file though. The linker needs the information in the DLL's LIB file to resolve the fixups properly and generate the delay-loading stubs. If you want to verify that you did everything correctly, DUMPBIN.EXE now has a /DEPENDENTS option that tells you which DLLs are implicitly linked against and which are delay-loaded.

/DELAYLOAD Nuts and Bolts
In a nice bit of design and implementation, Visual C++ not only makes the /DELAYLOAD extensible (and even user-replaceable), it also provides the source for the default implementation of the /DELAYLOAD code. The Visual C++ 6.0 \INCLUDE directory contains two files: DELAYHLP.CPP and DELAYIMP.H. In DELAYHLP.CPP, the key function to look at (which takes up about two-thirds of the file) is __delayLoadHelper. The__delayLoadHelper code can look a little intimidating at first glance. A lot of its complexity has to do with error checking and notifications, which I'll explain momentarily.
To help you understand what __delayLoadHelper does, I've distilled its primary operations down to the pseudocode that appears in Figure 1. The pseudocode assumes everything goes off without a hitch—without any errors, and without any notification callouts (that is, hook functions).
In studying __delayLoadHelper, it's essential to understand that the linker creates a pseudo Import Address Table (IAT) and Import Name Table (INT) for each delay-loaded DLL. There's one IAT and INT per referenced DLL. Each IAT and INT entry represents one imported function in that DLL. Note that the IAT and INT data structures also happen to be used by regular DLL imports. In the case of /DELAYLOAD, the operating system doesn't know (or care) that the executable has additional pseudo import tables. Rather, the choice to store the delay import tables in the same format as normal imports is solely a matter of convenience for the linker and __delayLoadHelper function.
In the pseudocode, beyond the calls to LoadLibrary and GetProcAddress, note that the rest of the code manipulates the pseudo IAT and INT. The purpose is to optimize subsequent calls to the API. The first time an API is called, the IAT entry for the API points to linker-generated code. After the stub successfully completes, the IAT entry points directly at the target function (see Figure 2). Another way to picture it is like this: within your executable, a delay-loaded DLL has a pseudo IAT with entries that are patched with the target address on demand.

Figure 2  Using Pseudo Import Address Table
Figure 2 Using Pseudo Import Address Table


Now that you've seen the main structure of the __delayLoadHelper function, let's look at the notification hook functions that permeate its code. As __delayLoadHelper executes, it has provisions to call a user-supplied notification function at particular times. The notifications occur on these occasions: when the __delayLoadHelper function begins, before calling LoadLibrary, before calling Get ProcAddress, and when the __delayLoadHelper function finishes processing.
If you provide a notification hook function, you can make it return values that short circuit some of the __delayLoadHelper code. For example, by returning a FARPROC address for the pre-GetProcAddress notification, __delayLoadHelper will use your value rather than calling GetProcAddress. Likewise, by returning an HMODULE from the pre-LoadLibrary notification, you'll bypass the LoadLibrary call that would ordinarily occur.
In the normal case that I've just described, notification hook functions won't actually be called because the function pointer used to call the notification hook is NULL. However, you can write a notification function in your code and override the function pointer to point at your notification function. You'll see this in the sample program later.
The notification hook function is prototyped like this:

 FARPROC WINAPI
 PfnDliHook(unsigned dliNotify, PDelayLoadInfo pdli);
The dliNotify parameter is one of the dliXXX enums specified in DELAYIMP.H. The pdli parameter is a pointer to a DelayLoadInfo structure that is also declared in DELAYIMP.H. The DelayLoadInfo structure contains everything you need to know about this particular API, including the API and DLL names. The DelayLoadInfo structure is constructed on the fly in the __delayLoadHelper function.
Beyond normal notification hooks, the delay-load code can also call a second function in the event of an error (for example, if the DLL wasn't found). This error hook function takes exactly the same arguments as the regular notification hook function. The only difference is that the dliXXX enums indicate failure states (dliFailLoadLib or dliFailGetProc).
As with the notification hook functions, there is no default implementation for the failure hooks. To provide one, write your own failure function, then declare a global variable named __pfnDliFailureHook that points to your failure function. The linker will use your __pfnDliFailureHook variable rather than the one in DELAYIMP.LIB, which contains a NULL pointer. By making your failure hook function return appropriate values, you can recover from the failure gracefully. For instance, if you receive a dliFailLoadLib failure notification, you might prompt the user for the location of the DLL and call LoadLibrary yourself. The resultant HMODULE would then be used as the failure hook return value.

The Linker Gets into the Act
Earlier, I mentioned that when using /DELAYLOAD, the linker gets into the code-generation business. However, everything I've described so far with the __delayLoadHelper function and the hook functions doesn't require any assistance or code from the linker. So what does the Visual Studio® 6.0 linker do that actually makes this whole thing possible?
When you use /DELAYLOAD on a given DLL, the linker generates two different types of stubs. The first of these stubs is the "per-API" stub. The linker generates one of these stubs for each API called in the imported DLL. The linker assigns a name in the form of __imp_load_XXX for the stub, where XXX is the API name. For example, the per-API stub for a call to GetDesktopWindow looks like this:

 __imp__load_GetDesktopWindow@0:
 PUSH  ECX
 PUSH  EDX
 PUSH  __imp__GetDesktopWindow@0
 JMP   __tailMerge__USER32
This small snippet of stub code is worth scrutinizing carefully. The first two instructions (PUSH ECX and PUSH EDX) preserve the values of the ECX and EDX registers on the stack. The next instruction (PUSH __imp__GetDesktopWindow@0) pushes the address of the pseudo IAT entry for the GetDesktopWindow function. When I described the __delayLoadHelper code earlier, I mentioned that it patched the pseudo IAT entry before returning. This PUSH instruction is where __delayLoadHelper gets the location to patch. Before a delay load API is called for the first time, the pseudo IAT entry points to this per-API stub. This is how control gets to this stub rather than to the target API.
The final instruction of a per-API stub points to the second type of linker-generated stub ("per-DLL"). Thus, no matter how many functions you delay load from USER32.DLL and COMCTL32.DLL, there will still only be two stubs—one for USER32.DLL and the other for COMCTL32.DLL. The linker names these stubs __tailMerge_XXX, where XXX is the name of the DLL. For example, the stub for USER32 looks like this:
 __tailMerge_USER32:
 PUSH  __DELAY_IMPORT_DESCRIPTOR_USER32
 CALL  ___delayLoadHelper@8
 POP   EDX
 POP   ECX
 JMP   EAX
The first instruction of the per-DLL stub pushes the address of a data structure that the linker has included elsewhere in the executable. This struct is of type ImgDelayDescr, defined in DELAYIMP.H. The ImgDelayDescr struct contains pointers to the DLL name, pointers to the DLL's pseudo IAT and INT, and various other items needed by the __delayLoadHelper function. This is the data structure pointed at by the IMAGE_DIRECTORY_ENTRY_DELAY_ IMPORT slot in the executable's DataDirectory.
Here's an important side note to people writing PE file utilities. All the pointer values in an ImgDelayDescr are virtual addresses (that is, normal linear addresses that can be used as pointers). The use of virtual addresses is in sharp contrast to the relative virtual addresses (RVAs) used by the IMAGE_IMPORT_DESCRIPTOR structure for regular imports. This use of virtual addresses rather than RVAs is unprecedented. I assume that this is because the notification hooks are passed a pointer to the ImgDelayDescr, so it wouldn't do to have hook implementors using RVAs.
The next instruction of the per-DLL stub makes the actual call to __delayLoadHelper. The __delayLoadHelper returns the address of the target API in the EAX register. The next two stub instructions restore the ECX and EDX registers that were put on the stack by the per-API stub. The last instruction JMPs to the address in the EAX register (that is, the value returned by __delayLoadHelper function). Since __delayLoadHelper patched the pseudo IAT entry, the per-API stub only executes once. Subsequent calls go through the pseudo IAT directly to the target API. Likewise, the per-DLL stub only executes once for each delay-loaded API.
Figure 3  DelayLoadDemo Stubs
Figure 3 DelayLoadDemo Stubs


In Figure 3, I've shown what the stubs look like for the sample code I'll describe next. There are three per-API stubs: two for USER32 APIs, and one for a COMCTL32 API. You'll see that the per-API stubs in turn JMP to one of the two per-DLL stubs (one for USER32, the other for COMCTL32.DLL).

The DelayLoadDemo Program
Figure 4 shows the code for the DelayLoadDemo program. There are two distinct parts to the program. Function main makes three calls to GetTopWindow, then a call to GetDesktopWindow, and finally calls InitCommonControls. This code does nothing useful in and of itself. However, it serves the purpose of setting up a scenario where you can see multiple per-API stubs and multiple per-DLL stubs.
In Figure 5, note that I've used /DELAYLOAD on both USER32.DLL and COMCTL32.DLL. This makes all of the API calls in function main delay loaded. To really see the concepts I've described, build DelayLoadDemo with debug information and step through it in a debugger using mixed source and assembly language. You'll see that the first time an API is called, control goes through the stubs I've described. I included two additional calls to GetTopWindow to show that on subsequent calls the pseudo IAT has been modified and the CALL instruction goes directly to the GetTopWindow code.
The DelayLoadDemo code beyond function main shows an implementation of a notification hook. The MSJCheezyDelayLoadHook code simply displays the basic information passed to it (specifically, the notification type, the DLL name, the load address of the DLL, and the name of the target API). Note that in all cases, I return 0 from the function. This causes __delayLoadHelper to proceed as if the notification hook hadn't been called.
Figure 6 shows the results of running DelayLoadDemo. Notice that for each of the three APIs, there's a dliStartProcessing, dliNotePreLoadLibrary, dliNotePreGetProcAddress, or dliNoteEndProcessing notification. More importantly, note that the set of notifications only occurs once for GetTopWindow, even though it's called three times in function main. This proves that subsequent calls bypass the __delayLoadHelper code.
If I had simply included MSJCheezyDelayLoadHook in DelayLoadDemo.cpp, it wouldn't have been called. I had to take the extra step of declaring a global variable pointer named __pfnDliNotifyHook and initializing it with the address of the MSJCheezyDelayLoadHook function. You can see this on the last line of the DelayLoadDemo.cpp code. By doing this, I overrode the __pfnDliNotifyHook variable that resides in DELAYIMP.LIB and contains a NULL pointer.

Warnings and Caveats
For starters, just because you use delay loading on a DLL doesn't mean that some other DLL isn't implicitly linking to it. It's easy to outsmart yourself here. Consider what would happen if my sample code had called InitCommonControls first, before GetTopWindow. After calling InitCommonControls, USER32.DLL would be loaded. Why? Because COMCTL32.DLL implicitly links to USER32.DLL. The subsequent call to GetTopWindow causes __delayLoadHelper to call LoadLibrary on USER32.DLL, even though it's already been loaded. The __delayLoadHelper code __takes this scenario into account, but you still need to carefully consider the dependencies to see if using /DELAYLOAD is worthwhile.
Don't bother using /DELAYLOAD on KERNEL32.DLL. It simply won't work. Why not? The __delayLoadHelper function calls LoadLibrary and GetProcAddress, so KERNEL32.LIB must be implicitly linked in the normal manner.
If a DLL that you delay load performs per-process initializations in DllMain, you're likely to have problems. Often, DllMain assumes that no significant code in the process has executed when the DLL_PROCESS_ATTACH notification occurs. Likewise, you'll probably have problems if the delay-loaded DLL uses static Thread Local Storage—that is, __declspec(thread) variables. Static Thread Local Storage assumes that the DLL has been linked implicitly and has therefore seen every thread creation.
There are other factors to be aware of when using /DELAYLOAD in certain offbeat circumstances. The Visual C++ documentation goes into much more detail than I can here. Nonetheless, the /DELAYLOAD capabilities are a much needed feature, and hopefully you can use it to maximum advantage in future Win32 projects.

Have a suggestion for Under the Hood? Send it to Matt at mpietrek@tiac.com or http://www.tiac.com/users/mpietrek.

From the December 1998 issue of Microsoft Systems Journal.