Under the Hood, MSJ April 1998

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

April 1998

Under the Hood

Download apr98hoodcode.exe (10KB)

Matt Pietrek is the author of Windows 95 System Programming Secrets (IDG Books, 1995). He works at NuMega Technologies Inc., and can be reached at mpietrek@tiac.com or at http://www.tiac.com/users/mpietrek.

Here's a problem nearly every C++ programmer has encountered. In your code, you've made a call to a function in some DLL and the linker complains that it can't find the symbol. It usually doesn't take too long to figure out that you need to add another library (.LIB) file to the linker's command line. The only problem is, which .LIB file?
      From day one, certain formats have remained relatively constant in the Microsoft® Win32® development tools, and many tools have sprung up around them. For example, the Microsoft DUMPBIN utility can be used to display the contents of both Portable Executable files and COFF (Common Object File Format) .OBJ files. (For users of Visual Basic® 5.0, the command line
LINK –dump
is functionally the same as DUMPBIN.) However, when it comes to libraries, there seems to be a real dearth of tools that can intelligently tell you the contents of a COFF format .LIB file. All 32-bit Microsoft tools use COFF.
      Perhaps you need to know if a function is imported by name versus ordinal value. DUMPBIN isn't much help here. Sure, DUMPBIN has a few obscure command options for .LIB files (/ARCHIVEMEMBERS and /LINKERMEMBER, for example). But they just provide raw output of portions of the .LIB file. A few gurus can cast the runes of DUMPBIN's output to figure out what they're after. However, to really see what's in a .LIB file, you need either a good understanding of .LIB file structures or a tool that displays the .LIB contents in a meaningful manner. In this column, I'll provide some relief on both counts.
      While mucking about inside .LIB files might appear forbidding, they're really not complicated. Essentially, a .LIB file is just a collection of COFF format .OBJ files strung sequentially together. A table of contents at the beginning tells the linker where things are. Actually, there are two tables of contents, but this detail isn't important for the ensuing discussion.
      In my July 1997 column, I described the basic principles of how a linker works. The important factoid for this column is that a linker is responsible for resolving symbols between compilation units. For example, if MyFile1.CPP calls function FooBar in another source file, the linker has to locate the .OBJ file containing FooBar's binary code and include it in the finished executable. From the linker's perspective, a .LIB file is just a collection of .OBJ files. The table of contents in a .LIB file is a list of all the symbols from all the .OBJs contained in the library. For each symbol, the table of contents also indicates which .OBJ file the symbol came from. This mapping of a symbol name to an .OBJ file allows the linker to quickly bring in just the .OBJ from the .LIB file that it needs, while ignoring the rest of the library.
      You might be thinking, "What about import libraries? Aren't they special?" Under the Win32 COFF format, the answer is no. The linker resolves calls to DLL functions the same way as it does for internal (static) functions. The only real difference is that when you call a DLL function, the .OBJ file in the import library provides data for the executable's import table rather than code for the actual function.
      The data that an import library provides for an imported API is kept in several sections whose names all begin with .idata (for instance, .idata$4, .idata$5, and .idata$6). The .idata$5 section contains a single DWORD that, when the executable loads, contains the address of the imported function. The .idata$6 section (if present) contains the name of the imported function. When loading the executable into memory, the Win32 loader uses this string to call GetProcAddress on the imported function effectively.
      As I described in the July 1997 column, the linker lumps together sections that have the same name up to, but not including, the $. The portion after the $ is used to order the sections. Thus, all the .idata$4 sections are put in the executable contiguously, followed by all the .idata$5 sections, and finishing with all the .idata$6 sections. The linker's combining and sorting of sections is what builds the import address table (IAT) and other parts of the imports table in a finished executable. Not surprisingly, an executable's imports table is usually in a section that is named .idata.
      If you've used OLE, COM, or ActiveX®, you probably remember that there are also .LIB files that are used for predefined class IDs (CLSIDs) and interface IDs (IIDs). Both CLSIDs and IIDs are forms of GUIDs, which are 16-byte unique values. If you poke around in one of these import libraries (for instance, UUID.LIB), you'll see that the GUID values are stored in a section called .rdata. The linker takes all the referenced .rdata sections in the .LIB file and creates the .rdata section in the executable. Put differently, every GUID that you reference in your program reserves 16 bytes in the final executable.
The COFF .LIB File Structure
      Before I explain how a tool can provide an intelligent display of a .LIB file's contents, it's helpful to have a basic understanding of how COFF .LIBs are constructed. The first thing you'll need to tuck away in your memory banks is that in COFF the words "archive" and "library" are used interchangeably. The second tidbit to remember is that components of a .LIB file are referred to as members. Thus, a .LIB file is really just a series of contiguous archive members. With two exceptions that I'll get to momentarily, each archive member corresponds to an .OBJ file.
      All COFF .LIB files begin with an 8-byte header, which reads "!<arch>\n" when viewed as ASCII text. You can see this in WINNT.H as the #define for IMAGE_ARCHIVE_START. Following this header is the first of potentially many archive members. Each archive member begins with a structure called an IMAGE_ARCHIVE_MEMBER_ HEADER, which is also defined in WINNT.H. This structure contains information such as the member's name and size. Interestingly, one of the strings in an archive member header is in the octal number format. Yes, these throwbacks to computing's infancy continue to rattle around in today's supercharged barn-burners.
      The first two archive members in a COFF .LIB file are special. Instead of .OBJ files, they act as a table of contents to the other archive members (that is, to the .OBJs). These are called linker members (see the IMAGE_ARCHIVE_LINKER_MEMBER #define in WINNT.H). These members map a symbol name (for instance, _CreateProcessA@40) to the offset of the archive member containing the code or data associated with that symbol. The two special linker members both contain the same information. The only difference is in how the symbol names are sorted.
       Figure 1 shows the format of a names linker member. Following the IMAGE_ARCHIVE_MEMBER_HEADER is a DWORD with the number of symbols in the library. Next is an array of DWORD offsets to other archive members in the library. Following the DWORD array is a series of null-terminated symbol name strings. Each successive entry in the DWORD array corresponds to the next string in the string table.

Figure 1 Names Linker Member

Figure 1 Names Linker Member

Figure 2 Archive Member

Figure 2 Archive Member
      The format of the other non-names archive members is even simpler. It's just an archive member header, followed by an .OBJ file. If you're not familiar with the layout of an .OBJ file, it consists of an IMAGE_FILE_HEADER followed by one or more IMAGE_SECTION_HEADER structures, one for each code or data section. Next comes the raw code and data for the sections. Bringing up the rear is the symbol table, which correlates symbol names to specific locations in the .OBJ's code and data. All of these data structures are the same as those used in executable files, and are described in WINNT.H. Figure 2 shows the layout of one of these .OBJ-based archive members.

Inside LibDump
      If you really understand everything I just described, you could use DUMPBIN with the /ALL option to figure out anything you might want to know about a .LIB file. For example, if you needed to know what the import ordinal for the CreateUpDownControl API is, you'd run DUMPBIN /ALL on COMCTL32.LIB. In the beginning of DUMPBIN's output, you'd find the string "CreateUpDownControl". On the same line would be the offset of the matching .OBJ file. You'd then search the dump output for the archive member at that file offset. Somewhere within the information for that .OBJ, you'd locate the raw data for .idata$5, which reads:

RAW DATA #5 00000000 10 00 00 80

      Converting the four bytes into a DWORD and accounting for the little Endian nature of Intel CPUs, you have a value of 0x80000010. Removing the high bit (which means the symbol is exported by ordinal only) gives an export ordinal of 0x10, or 16 decimal. What a major pain! This is where the LibDump program jumps in and does all the hard work of interpreting .LIB files for you.
      LibDump's mission statement is simple: for each symbol name in the .LIB file's first names member, LibDump tries to figure out what type of symbol it is and prints out all the relevant data. For example, LibDump determines if a symbol is imported by ordinal or by name. If it's by ordinal, LibDump shows you the ordinal value. If by name, LibDump shows the actual symbol name and the name that appears in the imports table (for instance, the symbol _CreateProcessA@40 is imported as CreateProcessA). If the symbol appears to be a GUID, LibDump displays the 16 bytes of GUID data as you'd expect to see it. If all else fails, LibDump just displays the symbol name. This would be the case for static functions and variables.
      LibDump is a console-mode program that takes one command-line argument, the name of the .LIB file to work with. Function main in LibDump.CPP (see Figure 3) opens a memory-mapped file using the command-line argument as the file name. I've wrapped all memory-mapped file code in a C++ class called MEMORY_MAPPED_FILE, which is implemented in MemoryMappedFile.H and MemoryMappedFile.CPP (see Figure 4). I'm simply reusing the MEMORY_MAPPED_FILE class from a previous column, so I won't describe the class methods here.
      Once function main has mapped the .LIB file into memory successfully, it verifies that the file begins with the expected string that starts an archive. Following that string is the first archive member, which (as I described earlier) contains the names of the symbols in the .LIB, along with the offset of the matching .OBJ file. The second half of function main simply locates and iterates through both the member offset array and the matching string table. For nearly every symbol, function main passes the name and archive member offset to the DisplayLibInfoForSymbol function. (I'll explain why a few symbols are excluded later.)
      The DisplayLibInfoForSymbol function is where all the action occurs for figuring out what type of symbol it is. The code doesn't bother with the IMAGE_ARCHIVE_MEMBER_HEADER at all. Instead, it immediately skips to the IMAGE_FILE_HEADER that begins the .OBJ. For LibDump's purposes, the important thing in the IMAGE_FILE_ HEADER is how many IMAGE_SECTION_HEADERs follow it. LibDump uses this number to loop through each entry in the array of IMAGE_SECTION_HEADERs.
      As the code loops through the IMAGE_SECTION_ HEADERs, it's looking for sections with specific names. The presence or absence of a particular section gives a good indication of what type of symbol LibDump is working with. Any .OBJ that contains .idata$5 or .idata$6 sections is probably for an imported API. In my experience, you'll always see an .idata$5 section for an imported API. If the import is via ordinal, the DWORD value in the .idata$5 section has the high bit set, and the low WORD is the ordinal value. Otherwise, the .idata$5 DWORD is zero.
      If the symbol is imported by name, you'll also find a .idata$6 section in the .OBJ. The first WORD of this section is the "hint" ordinal, which is essentially useless these days. Immediately following the hint ordinal is the null-terminated ASCII string with the name of the API as it will appear in the executable's import table.
      If the .OBJ file doesn't have any .idata$ sections, but does have an .rdata section, then the symbol may or may not be a GUID symbol (for example, _IID_IDispatch). In the LibDump code, I took the cheesy approach of checking the size of the .rdata section. If it's exactly 0x10 bytes long, and if there's only one .rdata section in the .OBJ, LibDump assumes it's a GUID symbol and displays it as such.
      This trick for GUID symbol detection works great for some .LIBs, including the all-important UUID.LIB. However, many COM import libraries lump multiple GUIDs into a single .rdata section. LibDump isn't smart enough to catch these cases. To make it do this would require LibDump to read the COFF symbol table at the end of the .OBJ file. While it could be done, it would add quite a bit to the LibDump code and detract from the "small is beautiful" approach I took.
      If you've noticed the ConvertBigEndian function, good catch! It turns out that in COFF format .LIBs and .OBJs, certain fields are stored in the big Endian format. Intel CPUs use the opposite format, known as little Endian. In a big Endian number, the most significant bytes are at lower addresses. In my conversion function, I could have used the Intel BSWAP instruction to convert from big Endian to little Endian format, but then the code wouldn't run on DEC Alpha-based systems.
      The IsRegularLibSymbol function is just a convenient way to filter the output so that certain symbols don't appear in the output. The brief synopsis is that the .LIB file contains some symbols that are used by the linker but have no connection to the user's code. For example, the symbol __IMPORT_DESCRIPTOR_COMCTL32 appears in COMCTL32.LIB. Rather than cluttering up the LibDump output with these symbols, the IsRegularLibSymbol function looks for certain patterns in the symbol names and returns FALSE if it looks like the symbol didn't originate from user code.

Running LibDump
      To wrap up, let's look at the results of running LibDump on a couple of standard .LIB files. Figure 5 is a snippet of LibDump's output for COMCTL32.LIB (the common control DLL). Note in the Type column how some APIs are imported by ordinal (ORDN), like _CreateStatusWindowA@16. Other APIs are imported by name (NAME), such as _CreateToolbarEx@52. Also, note that when importing by name, LibDump shows the name that will actually appear in the imports table (using the prior example, CreateToolbarEx).
       Figure 6 shows a portion of the results from running LibDump on UUID.LIB (the import library for most of the standard GUIDs). Each line indicates that it's for a GUID, and then shows the GUID represented like you'd see under the HKEY_CLASSES_ROOT key of the registry. In the LibDump output I've selected, you'll find category IDs (CATID_xxx), class IDs (CLSID_xxx), GUIDs, and IIDs (IID_xxx).
      LibDump packs quite a lot of useful functionality into a small amount of code. If you're up to it, there are several things I can think of to improve it. One idea would be to make it scan for a particular symbol across multiple .LIB files. Another would be to add a real GUI to it. If you were to add code to read the symbol table at the end of the .OBJ members, you could significantly expand the interpretation for a given symbol. Even without these features, I hope LibDump will be a welcome addition to your toolset for Win32-based programming.

Have a question about programming in Windows? Send it to Matt at mpietrek@tiac.com

From the April 1998 issue of Microsoft Systems Journal.