Performance Considerations in Memory Management

How an application uses its chosen memory allocators is just as important as choosing them in the first place. An application that spends a significant portion of its time in the heap allocator is not well-designed. It should be using custom allocators to create objects with less overhead.

Multithreaded server applications can use the following features of HeapAlloc to improve their performance:

Give a thread its own heap (locality) if its dynamic memory needs warrant this.
Use the HEAP_NO_SERIALIZE flag to the Win32 HeapAlloc function to eliminate critical section waits and improve scalability. If you use this flag, however, you must manage heap access yourself or risk corrupting the heaps.
Eliminate the need to free individual items by creating a temporary heap and destroying the heap when done. This saves overhead and gives more locality, thus improving performance.
Eliminate the overhead of checking for memory allocation failure by using the HEAP_GENERATE_EXCEPTIONS flag to HeapAlloc, and handling memory allocation failure as an exception.
To maximize performance by increasing locality, it may still be necessary for the application to create custom allocators for various types of objects. Using HeapAlloc alone may not suffice.

Another C runtime function, _alloca, should be used instead of the heap allocator when the item to be allocated is of function scope, requires only short-term memory use, and its size can vary. The _alloca function simply extends the function’s stack frame, so it is very fast. When the function returns, the allocation evaporates.

It is not unusual for applications that run well on single processor systems to degrade in performance on a multiprocessor system. This is almost always the result of excessive contention for locks, particularly for critical sections. The Win32 GlobalAlloc function, the OLE IStorage and IStream functions, and similar allocator functions all use critical sections for synchronization. For a single processor system, it is very rare to find contention on these critical sections since the thread that owns the critical section is unlikely to be preempted. On a multiprocessor system, such contention is much more likely since more than one thread is running simultaneously.

The easiest way to determine whether an application is having trouble with contention is to run the application first on a single processor machine and then on a multiprocessor machine and compare the context switches per second and the system calls per second. These will both increase when there is contention. When there is no contention on a critical section, it is very fast and requires no context switching or system calls. When there is contention (one thread tries to acquire the critical section while another thread holds it) additional system calls and context switches are required in order for the requesting thread to wait until the owning thread has released the critical section.

Other interesting statistics that can uncover a performance deterioration because of an allocation problem are the number of context switches per transaction and the number of system calls per transaction. If these numbers change dramatically with the number of processors or the amount of load on the system, there is probably an allocation problem with the application.