Murali R. Krishnan
Microsoft Corporation
February 1999
Summary: Discusses common heap performance problems and how to protect against them. (9 printed pages)
Are you a happy-go-lucky user of dynamically allocated C/C++ objects? Do you use Automation extensively for communicating back and forth between modules? Is it possible that your program is slow because of heap allocation? You are not alone. Almost all projects run into heap issues sooner or later. The common tendency is to say, "It's the heap that's slow and my code is really good." Well, that is partially right. It pays to understand more about heap, its usage, and what can happen.
(If you already know what a heap is, you can jump ahead to the section "What Are Common Heap Performance Problems?")
A heap is used for allocating and freeing objects dynamically for use by the program. Heap operations are called for when:
A heap uses parts of memory outside of what is allocated for the code and stack during run time. The following graph shows the different layers of heap allocators.
GlobalAlloc/GlobalFree: Microsoft® Win32® heap calls that talk directly to the per-process default heap.
LocalAlloc/LocalFree: Win32 heap calls (for compatibility in Microsoft Windows NT®) that talk directly to the per-process default heap.
COM's IMalloc allocator (or CoTaskMemAlloc / CoTaskMemFree): Functions use the default per-process heap. Automation uses the Component Object Model (COM)'s allocator, and the requests use the per-process heap.
C/C++ Run-time (CRT) allocator: Provides malloc() and free() as well as new and delete operators. Languages like Microsoft Visual Basic® and Java also offer new operators and use garbage collection instead of heaps. CRT creates its own private heap, which resides on top of the Win32 heap.
In Windows NT, the Win32 heap is a thin layer surrounding the Windows NT run-time allocator. All APIs forward their requests to the NTDLL.
The Windows NT run-time allocator provides the core heap allocator within Windows NT. It consists of a front-end allocator with 128 free lists for sizes ranging from 8 to 1,024 bytes. The back-end allocator uses virtual memory to reserve and commit pages.
At the bottom of the chart is the Virtual Memory Allocator, which reserves and commits pages used by the OS. All allocators use the virtual memory facility for accessing the data.
Shouldn't it be simple to allocate and free blocks? Why would this take a long time?
Traditionally, the operating system and run-time libraries come with an implementation of the heap. At the beginning of a process, the OS creates a default heap called Process heap. The Process heap is used for allocating blocks if no other heap is used. Language run times also can create separate heaps within a process. (For example, C run time creates a heap of its own.) Besides these dedicated heaps, the application program or one of the many loaded dynamic-link libraries (DLLs) may create and use separate heaps. Win32 offers a rich set of APIs for creating and using private heaps. See the MSDN Platform SDK node for an excellent tutorial on Heap Functions.
When applications or DLLs create private heaps, these live in the process space and are accessible process-wide. Any allocation of data made from a given heap should be freed for the same heap. (Allocation from one heap and free to another makes no sense.)
The heap sits on top of the operating system's Virtual Memory Manager in all virtual memory systems. The language run-time heaps reside on top of the virtual memory, as well. In some cases, these are layered on OS heaps, but the language run-time heap performs its own memory management by allocating large blocks. Bypassing the OS heap to use the virtual memory functions may enable a heap to do a better job of allocating and using blocks.
Typical heap implementations consist of front-end and back-end allocators. The front-end allocator maintains a free list of fixed-sized blocks. On an allocation call, the heap attempts to find a free block from the front-end list. If this fails, the heap is forced to allocate a large block from the back end (reserving and committing virtual memory) to satisfy the request. Common implementations have per-block allocation overhead, which costs execution cycles and also reduces available storage.
The Knowledge Base article Q10758, "Managing Memory with calloc() and malloc()" (search on article ID number), contains more background on these topics. Also, a detailed discussion on the heap implementations and designs can be found in "Dynamic Storage Allocation: A Survey and Critical Review" by Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. In International Workshop on Memory Management, Kinross, Scotland, UK, September 1995 (http://www.cs.utexas.edu/users/oops/papers.html).
Windows NT's implementation (Windows NT version 4.0 and later) uses 127 free lists of 8-byte aligned blocks ranging from 8 to 1,024 bytes and a grab-bag list. The grab-bag list (free list[0]) holds blocks greater than 1,024 bytes in size. The free list contains objects linked together in a doubly linked list. By default, the Process heap performs coalescing operations. (Coalescing is the act of combining adjacent free blocks to build a larger block.) Coalescing costs additional cycles but reduces internal fragmentation of heap blocks.
A single global lock protects the heap against multithreaded usage. (See the first commandment in "Server Performance and Scalability Killers" by George Reilly on the MSDN Online Web Workshop at http://msdn.microsoft.com/workshop/server/iis/tencom.asp.) This lock is essential to protecting the heap data structures from random access across multiple threads. This lock can have an adverse impact on performance when heap operations are too frequent.
Here are the most common obstacles you will encounter when working with the heap:
Contention usually leads to context switch of the threads and processes. Context switches are very costly, but even more costly is the loss of data from the processor cache and the rebuilding of that data when the thread is brought to life afterwards.
Contention is the problem that introduces slowdown in the allocation as well as free operations. Ideally we would like to have a heap with no contention and fast alloc/free. Alas, such a general-purpose heap does not exist yet, though it might sometime in the future.
In all the server systems (such as IIS, MSProxy, DatabaseStacks, Network servers, Exchange, and others), the heap lock is a BIG bottleneck. The larger the number of processors, the worse the contention.
Now that you understand the problems with heap, don't you want the magic wand that can eliminate these problems? I wish there were one. But there is no magic to make the heap go faster—so don't expect to make things faster in the last week before shipping the product. Plan your heap strategy earlier, and you will be far better off. Altering the way you use the heap, and reducing the number of heap operations, is a solid strategy for improving performance.
How do you reduce the use of heap operations? You can reduce the number of heap operations by exploiting locality inside data structures. Consider the following example:
struct ObjectA {
// data for objectA
}
struct ObjectB {
// data for objectB
}
// Use of ObjectA and ObjectB together.
//
// Use pointers
//
struct ObjectB {
struct ObjectA * pObjA;
// data for objectB
}
//
// Use embedding
//
struct ObjectB {
struct ObjectA pObjA;
// data for objectB
}
//
// Aggregation – use ObjectA and ObjectB inside another object
//
struct ObjectX {
struct ObjectA objA;
struct ObjectB objB;
}
Chunking is processor cache-friendly, especially for the L1-cache, because of the increased locality it offers—not to mention that some of the data blocks are located in the same virtual page for chunked allocations.
The savings you will gain by using the preceding techniques will vary with object types, sizes, and workload. But you will always gain in performance and scalability. On the downside, the code will be a bit specialized, but it can be manageable if well-thought-out.
The following are a few more techniques for enhancing speed:
Thanks to the efforts and hard work of several folks, a few significant improvements went into Microsoft Windows® 2000 in early 1998:
These improvements eliminate the need for allocation caches, but do not preclude other optimizations. Evaluate your code with Windows NT5 heap; it should be optimal for blocks of less than 1,024 bytes (1 KB) (blocks from front-end allocator). GlobalAlloc() and LocalAlloc() build on the same heap and are common mechanisms to access the per-process heaps. Use Heap* APIs to access the per-process heap, or to create your own heaps for allocation operations if high localized performance is desired. You can also use VirtualAlloc() / VirtualFree() operations directly if needed for large block operations.
The described improvements made their way into Windows 2000 beta 2 and Windows NT 4.0 SP4. After the changes, the rate of heap lock contention fell significantly. This benefits all direct users of Win32 heaps. The CRT heap is built on top of the Win32 heap, but uses its own small-block heap and hence does not benefit from Windows NT changes. (Visual C++ version 6.0 also has an improved heap allocator.)
An allocation cache allows you to cache allocated blocks for future reuse. This can reduce the number of alloc/free calls to the process heap (or global heap), as well as allow maximum reuse of the blocks once allocated. In addition, allocation caches permit you to gather statistics to gain a better understanding of object usage at the higher level.
Typically, a custom heap allocator is implemented on top of the process heap. The custom heap allocator behaves much like the system heap. The major difference is that it provides a cache on top of the process heap for the allocated objects. The caches are designed for a fixed set of sizes (for example, 32 bytes, 64 bytes, 128 bytes, and so on). This is a good strategy, but this sort of custom heap allocator misses the semantic information related to the objects that are allocated and freed.
In contrast to the custom heap allocators, "Alloc-caches" are implemented as a per-class allocation cache. They can retain a lot of semantic information, in addition to providing all the goodies of the custom heap allocator. Each allocation cache handler is associated with one object in the target binary. It can be initialized with a set of parameters indicating the concurrency level, size of the object, number of elements to keep in the free list, and so on. The allocation cache handler object maintains its own private pool of freed entities (not exceeding the specified threshold) and uses private lock for protection. Together, the allocation cache and private locks reduce the traffic to the main system heap, thus providing increased concurrency, maximum reuse, and higher scalability.
A scavenger is required periodically to check the activity of all the allocation cache handlers and reclaim unused resources. If and when no activity is found, the pool of allocated objects can be freed, thus improving performance.
Each alloc/free activity can be audited. The first level of information includes a total count of objects, allocations, and free calls made. You can derive the semantic relationship between various objects by looking at their statistics. The relationship can be used to reduce memory allocation using one of the many techniques just explained.
Allocation caches also act as a debugging aid in helping you track down the number of objects that are not properly cleaned up. You can even find exact callers in fault by looking at dynamic stack back traces and signatures in addition to the objects that were not cleaned up.
MP heap is a package for multiprocessor-friendly distributed allocation and is available in the Win32 SDK (Windows NT 4.0 and later). Originally implemented by JVert, the heap abstraction here is built on top of the Win32 heap package. MP heap creates multiple Win32 heaps and attempts to distribute allocation calls to different heaps with the goal of reducing the contention on any single lock.
This package is a good step—sort of an improved MP-friendly custom heap allocator. However, it offers no semantic information and lacks in statistics. A common way to use the MP heap is as an SDK library. You can benefit greatly if you create a reusable component with this SDK. If you build this SDK library into every DLL, however, you will increase your working set.
To scale on multiprocessor machines, algorithms, implementation, data structures, and hardware have to scale dynamically. Look at the data structures most commonly allocated and freed. Ask yourself, "Can I get the job done with a different data structure?" For example, if you have a list of read-only items that is loaded at application initialization time, this list does not need to be a linear-linked list. It can very well be a dynamically allocated array. A dynamically allocated array would reduce heap blocks in memory, reduce fragmentation, and therefore give you a performance boost.
Reducing the number of small objects needed reduces the load on the heap allocator. For example, we used five different objects in our server's critical processing path, each separately allocated and freed. Caching the objects together reduced heap calls from five to one and dramatically reduced the load on the heap, especially when we were processing more than 1,000 requests per second.
If you make extensive use of Automation structures, consider factoring out Automation BSTRs from your mainline code, or at least avoid repeated operations on BSTR. (BSTR concatenation leads to excessive reallocs and alloc/free operations.)
Heap implementations tend to stay general for all platforms, and hence have heavy overhead. Each individual's code has specific requirements, but design can accommodate the principles discussed in this article to reduce heap interaction.
Murali Krishnan is a lead software design engineer with the Internet Information Server (IIS) team. He has worked on IIS since version 1.0 and has successfully shipped versions 1.0 through 4.0. Murali formed and led the IIS performance team for three years (1995-1998), and has influenced IIS performance from day one. He holds an M.S. from the University of Wisconsin-Madison and a B.S. from Anna University, India, both in Computer Science. Outside work, he reads, plays volleyball, and cooks at home.