Under the Hood, MSJ November 1999

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

November 1999

Under the Hood

Code for this article: Nov99Hood.exe (31KB) Matt Pietrek does advanced research for the NuMega Labs of Compuware Corporation, and is the author of several books. His Web site at http://www.tiac.net/users/mpietrek has a FAQ page and information on previous columns and articles.

Recently I attended a customer council meeting at the lab where I work. During the lunch break I got into a conversation with two customers, both with huge applications. They were bemoaning the fact that it's so difficult to figure out where their program's memory is going.
     In the Mem Usage column of the Windows NT® Task Manager, you can see that the memory used by a process may rise or fall drastically. A common developer problem is that an application's memory usage as shown by TaskMan goes up, but your typical memory-tracking tools don't show corresponding memory or resource leaks. The underlying reason for this discrepancy is that most tools focus on heap-allocated memory. This is a very narrow view of the world!
     It's easy for people to forget that memory consumed by a process is much more than just its calls to malloc or new. Generally speaking, almost all the memory used by a process falls into one of these categories:
Executable code in a loaded module
Read-only data in a loaded module (including resources)
Writable memory in a module (for example, the .data section)
Win32® heaps (including the default heap)
Suballocated heaps (for example, from the Visual C++® runtime library)
VirtualAlloced memory
Memory-mapped files
Thread stacks
Environment
System data structures (including the Thread Information Block and page tables)
     In its normal course of execution, Windows® pages memory in and out of the process address space. For instance, pages making up the code and data areas of a loaded DLL don't use any physical memory until something references them. When a reference occurs, only the touched page is mapped in. Likewise, when system memory gets tight, Windows can swap out pages of code and data. In this very dynamic situation, limiting yourself to watching heap allocations quickly leads to frustration.
MemDiff: A First Stab
     Pondering this problem, it came to me that viewing the process memory space at the page level is a more logical approach than monitoring heaps. Thus was born the MemDiff library. MemDiff is a crude yet effective way to see how your process's use of memory changes between two points in your code. Although I would have liked to have made MemDiff work on Windows 95 and Windows 98, too many APIs I needed are only available in Windows NT.
     MemDiff consists of three simple functions. The first is MDTakeSnapshot, which takes a process handle and returns a snapshot handle. A snapshot is a relatively compact representation of your process's address space at the time you call it. In the simplest usage scenario, you'll make two calls to MDTakeSnapshot, once before the target section of code, and again after the target code has executed.
     The second function, MDCompareSnapshot, compares the two snapshots to report how much and where your address space has changed. In the report, logically related pages (for instance, adjacent pages in a heap) are lumped together. In addition, it attempts to provide a meaningful description of the pages.
     MDCompareSnapshot writes its output to a standard Win32 file handle, which you provide. This gives you flexibility for where the report goes. MDCompareSnapshot has an optional verbose parameter. If it's not specified, the default is a nonverbose report. This mode is usually easier to work with. More on this later.
     The third function is MDFreeSnapshot. After calling MDCompareSnapshot, you'll want to pass the snapshot handle to MDFreeSnapshot. As you'd expect, it frees the associated memory, which can be a nontrivial amount. A snapshot is good for only one comparison. As I'll show later, the act of comparing two snapshots destroys some of the data. I could have worked around this at the expense of additional memory and code complexity. I chose the easy way out, and put safeguards in the code to prevent erroneous results from using a snapshot more than once.
     To have the least side effects on your code, the downloadable sources build MemDiff to a static .LIB file that you link into your application or DLL. The .LIB file is built with the multithreaded C++ runtime LIBCMT(D).DLL. If you want to use MemDiff from a language that doesn't support static .LIBs (such as Visual Basic®), feel free to play with the project settings to make the MemDiff project compile as a DLL. If you do this, be sure to export the three functions mentioned previously.
Interpreting MemDiff Results
     To create results for discussing here, I wrote a small sample program that exercises the MemDiff code. Figure 1 shows the code for MemDiffDemo.cpp, which uses all three MemDiff APIs. Intermixed with the two MDTakeSnapshot calls, the program plays around with heap memory, loads and unloads DLLs, VirtualAllocs memory, and opens a memory-mapped file. The goal is to create an address space scenario where MDCompareSnapshot has a variety of things to display.
     For output, MemDiffDemo writes to stdout. This lets me see the results in a console window or redirect the output to a file with the > redirection operator. In your own code, you'll probably want to use CreateFile or some other API that returns a handle WriteFile can work with. Heck, go nuts and use named pipes to see MemDiff results from a process running on another machine. Who says I haven't embraced n-tier computing?
     Figure 2 shows the normal, nonverbose output from running MemDiffDemo. The first line shows the net difference in memory between the first and second snapshots. A negative value means less memory was being used when Snapshot2 was taken. The next line, Memory allocated, is how much newly paged-in memory was in Snapshot2. Line three, Memory freed, tells how much memory was present during Snapshot1, but not during Snapshot2.
     The fourth line breaks down the new memory in Snapshot2 into private and shared amounts. Private memory is memory used solely by your process. This includes heap memory and normal (non-shared) data sections in EXEs or DLLs. Shared memory is potentially usable by more than just your process. When looking for what's leaking in your application, the private memory is usually more likely to be of concern.
     Prime examples of shared memory are code pages and read-only data sections. Because these pages of memory don't change, the operating system can use the CPU hardware to map the physical pages of RAM into multiple address spaces. For example, every process uses code in NTDLL.DLL, but the same physical pages of RAM holding NTDLL.DLL's code are shared between all processes. Remember though, just because the memory can be shared doesn't mean that your process isn't the only one using it.
     After the initial four lines, the remainder of the MemDiff output is an accounting of memory that's unique to each snapshot. Although internally MemDiff is working with a raw list of physically present pages, it tries to coalesce related blocks. Under Snapshot2 in Figure 2, you'll see a handful of memory blocks of different sizes. For example, starting at address 0x00910000 are four pages of memory (16KB)
that belong to a Win32 heap. A bit later, you'll see 8KB used by USER32.DLL, 20KB used by GDI32.DLL, and so on. Stop and think about this. Various amounts of memory are being reported for DLLs. The DLLs listed were all loaded before Snapshot1 took place, yet Snapshot2 shows that they're using additional memory. How can this be the case?
     What you're seeing is the demand page loading of Windows NT in action. The code and data of an EXE or DLL aren't assigned to physical RAM until the EXE or DLL is read from, written to, or executed. Whatever I did between the two MemDiffDemo snapshots caused additional pages from USER32, KERNEL32, GDI32, and NTDLL to be mapped in.
     In this case, the additional DLL pages to were mapped in because MemDiffDemo loaded WININET.DLL, and then unloaded it. Note that no pages from WININET.DLL show up in the report, since it's completely gone from memory. However, whatever WININET.DLL did in its DllMain caused pages from other system DLLs to be mapped in. This is a great example of how your reported memory usage can go up without you doing anything wrong.
     Figure 3 shows a snippet of the verbose variation of MemDiff's results. To get this version of the output, set the final parameter of MDCompareSnapshot to true. The primary difference in the report is that all pages in the coalesced blocks must be contiguous and have exactly the same page attributes.
     Since coalesced blocks in verbose mode must have the same attribute, you'll often see multiple blocks for a given DLL (although not in the case of Figure 3). For instance, in a single DLL you may see one block reported for the code pages, another block for the resources, and a third block for the writable data section. Since all pages in a verbose report block have the same attributes, I included the attributes in the output. Refer to the QueryWorkingSet documentation for a description of the possible page attributes.

The MemDiff Source
     Figure 4 shows the primary guts of MemDiff. MDTakeSnapshot uses QueryWorkingSet from PSAPI.DLL, which I've included in the download files (Nov99Hood.exe (31KB)) in case it's not on your system. QueryWorkingSet returns an array of DWORDs, with each DWORD representing the address and page attributes of a page mapped into the address space. Since I don't know in advance how much memory QueryWorkingSet needs for the entire address space, I call VirtualAlloc in a loop until I get enough memory to hold all the page DWORDs.
     The snapshot handle that MDTakeSnapshot returns is just a pointer to the VirtualAlloced memory. At the beginning of the snapshot memory is a MEMDIFF_SNAPSHOT structure. Immediately following the MEMDIFF_SNAPSHOT is the array of DWORDs that QueryWorkingSet fills in. The MEMDIFF_SNAPSHOT members let me verify the validity of snapshots passed to the other MemDiff functions, determine how many pages are in the snapshot, and perform other housekeeping duties.
     The meat of the code is called from the MDCompareSnapshot function. The code first verifies that both snapshot handle parameters are valid. Next, the FilterOutCommonPages function throws out all pages that shouldn't be in the reports that follow. It does this by sorting and comparing both snapshots. A page that's in both snapshots is automatically thrown out. In addition, the pages holding the snapshot data itself are thrown out. Throwing out a page means setting its value in the snapshot array to 0, which is why a snapshot is only good for one comparison. Feel free to improve on this quick-and-dirty algorithm.
     After filtering down the snapshot data, MDCompareSnapshot then calls the SummaryReport and DetailedReport functions. Both functions write their output to the file handle you designate. The summary report is the simple stuff at the beginning of the output. It simply spins through both snapshots, counting the different types of pages as it goes. A little math and voilà, your summary results!
     The DetailedReportHelper function is much more complicated since it has the onerous task of identifying where a block of pages comes from. First, though, the function has to coalesce related ranges of pages. This is where the fVerbose parameter to MDCompareSnapshot comes in. With fVerbose set to false, the code lumps together all pages that have the same allocation base as reported by VirtualQuery. With fVerbose set to true, all pages in a coalesced block must be contiguous, have the same attributes as reported in the QueryWorkingSet bits, and must share the same allocation base.
     Once coalesced, DetailedReportHelper first tries to identify the block by calling GetModuleFileName. This easily finds pages that belong to the memory image of a loaded EXE or DLL. If the page isn't in that category, I next check to see if the block is in one of the process Win32 heaps. This includes the default heap (GetProcessHeap) and any heaps created by HeapCreate. Suballocator-style heaps, which use VirtualAlloc and partition the memory themselves, won't be detected. The current Visual C++ new and malloc fall into this category.
     The heap identification code is written as a class (CProcessHeaps), and implemented in ProcessHeaps.cpp and ProcessHeaps.h. This code, which can be found at the link at the top of this article, was written to be fast rather than excruciatingly accurate. It's possible for some blocks to slip by and not be identified. Feel free to fix my implementation to use the HeapWalk API at the expense of CPU time. While you're at it, you could add code to search whatever suballocator-style heaps are present. Happy hunting!
     Meanwhile back at the block type identification code, Marlin has a few words on insurance. Sorry, wrong storyline! If the block isn't identified by GetModuleFileName, or by the heap code, the options are dwindling. PSAPI.DLL has the GetMappedFileNameA API, which tells you if the page is from a memory-mapped file. If it's not a memory-mapped file, you're out of luck (at least in this episode). The memory could be plain old VirtualAlloced memory, additional stack pages, or who knows what. If you have a good algorithm for identifying what an arbitrary page of memory is, by all means try it out. If it works well, let me know.

MemDiff Caveats
     I've already mentioned some of the restrictions on the MemDiff library as currently written. It's not easy to use from Visual Basic or other languages that don't support static .LIBs. The solution is to rebuild the code as a DLL and be aware of the side effects. And of course, snapshots are only valid for one comparison.
     Beyond these usage restrictions, darker things lie. The big one is the potential for misidentifying a block of memory, particularly in the detailed report for the first snapshot. When a snapshot is taken, MemDiff doesn't attempt to identify each block in the snapshot. To do so would take extra time and memory. When the snapshots are compared, the identification code only has the process state at the time of the comparison to lean on. Since the process state can change dramatically between snapshot and comparison time, there's plenty of opportunity for error.
     You can easily construct scenarios where a page of memory used for one thing at the time of a snapshot is used for something else entirely at the time of identification (inside MDCompareSnapshot). The simplest example is a DLL in place during Snapshot1 that unloads, then another DLL loads in its place at the time of Snapshot2. The block identification code is likely to identify Snapshot1 pages as belonging to the wrong DLL. Alternatively, pages may go away entirely between snapshots. In this case, the identification code fails completely and reports <unknown>.
     A final note on strange results from MemDiff. You may occasionally see pages with addresses above 2GB. Usually these addresses are something like 0xC017F000. These are pages that Windows NT is using for storing the process memory map. If the process memory space grows enough, the kernel-mode code that manages the process memory space needs to allocate another page of memory to hold page table entries. Note that the discussion here assumes a "normal" Windows NT-based system with a maximum user-mode address of 2GB. It's possible to boot Windows NT in the /3GB mode where specially marked processes can use addresses up to 0xC0000000.

Some Final Words
     The MemDiff library has the potential to help in many circumstances. However, the code is still pretty rough. Given the restrictions of time and space, I leaned toward simplicity. I'd expect MemDiff to be tweaked and modified by others to better report on their specific scenarios.
     Among ideas for improvement, you could identify and tag memory pages at the time of a snapshot, look for and include compiler runtime heaps, and identify thread stack memory. (Remember, there might be multiple threads in the process.) If your code VirtualAllocs memory and has identifiable patterns to the data, its easy enough to add code to look for your specific data.
     Finally, a big thank you to Osiris Pedroso at Autodesk for helping me test MemDiff. Aside from just making suggestions, Osiris ran MemDiff on AutoCad itself. The output files he sent back made me knuckle down to come up with better algorithms than I originally wrote. MemDiff is significantly better for his help.

From the November 1999 issue of Microsoft Systems Journal.
Have a suggestion for Under the Hood? Send it to Matt at mpietrek@tiac.com or http://www.tiac.com/users/mpietrek.

From the November 1999 issue of Microsoft Systems Journal.