DR. GUI Online
December 15, 1997
Dr. GUI
MSDN Online
Tired of J/Direct yet? So's the doctor. It's wonderful stuff, but four (!) weeks of it (see "Going Native with J/Direct") is quite enough, huh? Dr. GUI must say, however, that he enjoyed greatly the opportunity to see how J/Direct bridges the sometimes nontrivial gaps between Java and Windows—even if you don't program in Java, studying it helps your understanding of Windows.
Anyway, Dr. GUI was reading his e-mail one day when he came across an interesting document from some folks on the Windows NT team. The document contains a set of guidelines for making applications load and execute more quickly under Microsoft Windows NT 5.0. While the tips are for Windows NT 5.0, many of them apply equally well to earlier versions of Windows NT and Windows 95.
Anyway, the good doctor thought that some of you could use these prescriptions, so he asked the folks at Windows NT if he could edit them up a bit and publish them. They gave permission, so here the list is.
This list isn't comprehensive; rather, the intent is to provide developers with a short list of "do's" and "don'ts." You'll certainly know other tricks for improving your program's performance.
Many of the tips attempt to reduce either your working set size or the number of pages your application touches. Reducing both of these makes your application faster (perhaps much faster in low-memory situations) because the operating system will have to do less swapping—if the applications are smaller, you don't need as much virtual memory.
This is kind of a hard-core recommendation, but if you spend some effort following it for your programs, you'll get some improvements in both speed and load time. If the effort needed to implement this for your program is high, be sure to do some quick tests to convince yourself that the performance gain is worth the pain.
Instead of C Runtime Library routines, use routines exported by NTDLL instead. There is absolutely no hit in performance for using these routines since NTDLL is already in the address space of every process on the system anyway. Linking with MSVCRT.LIB is okay. However, for better efficiency, you shouldn't link with LIBC.LIB or LIBCMT.LIB.
In general, using Win32 routines instead of C Runtime routines is a great idea for all Win32 platforms because it reduces your working set size, making your application faster because of reduced swapping. This is especially important if your application consists of more than one module since each module would get its own copy of each statically linked C Runtime routine it uses. (Using a common copy in a DLL is much more efficient in working space.) The only tradeoff is that such code is somewhat less portable than it might be otherwise.
Loading and linking a DLL is expensive (time consuming). And DLLs that you use normally are loaded and linked to your program at the very worst possible time: before any of your code executes—in other words, while your user is sitting and waiting for your application to start.
So if you link to a DLL in which you only use a few functions—or if you rarely actually use the DLL—then it makes sense to postpone loading and linking it. Today, on all Win32 systems, you can do this by loading it at runtime using LoadLibrary() and GetProcAddress(). Or if you're using Windows NT 5.0, note that the Windows NT 5.0 linker has a new "Delay Load" capability.
For any Win32 DLL, the DllMain function in your DLL will be called numerous times with a "reason" parameter set to a value that indicates why it was called. First, it's called with DLL_PROCESS_ATTACH once for each process that attaches to the DLL. In this situation, you'll often want to initialize some of the DLL's data structures. It's also called with DLL_PROCESS_DETACH when the process detaches from the DLL. Although it's less common, it's possible to use this opportunity to do clean up such as freeing memory, file handles, or other system resources.
However, DllMain is also called for each thread that attaches to (DLL_THREAD_ATTACH) or detaches from (DLL_THREAD_DETACH) the DLL.
Most DLLs do not need to be notified for thread create/destroy operations. You could simply return for thread attaches/detaches, but that would mean paging in DllMain each time a thread attaches or detaches, thereby increasing your working set. And remember, a bigger working set means more disk thrashing, which means slow performance.
So if your DLL doesn't need thread notifications, call DisableThreadLibraryCalls() during process attach (DLL_PROCESS_ATTACH). If you must do work during a thread attach/detach, take care to minimize the number of pages referenced to keep the working set as small as possible.
If you tell the compiler that a variable is const by using the const keyword, two things can happen. First, if the address of the constant is never referenced, it can be eliminated entirely from your variable space. Second, even if the variable's address is referenced, the compiler will put it in a read-only page of memory. Either of these mean that there will be fewer read/write (more properly, copy-on-write) pages—and that these pages will be more densely packed with memory that's actually changing over time. This can speed up paging tremendously because read-only pages don't have to be written to disk—they can simply be discarded and loaded from the original image file when they're needed.
In DLLs, since every process gets its own set of read/write and copy on write pages, properly declaring constants saves the copy-on-write pages in proportion to the number of processes that use a DLL, literally multiplying your savings. (In a DLL, the read/write pages are initially marked read-only and are shared between all processes that use the DLL. It's only when you write to a page that a new copy of the page is made for the process that did the write. Copying the page is time-consuming, so it's best to minimize it.) This applies to all Win32 applications, but especially to DLLs.
Many global variables are set during DLL initialization to values that could well have been set at compile time. Since your initialization writes to the pages, the pages will have to be copied on the first write—a rather expensive process. To avoid this, statically initialize the global variables to the correct values and be sure not to reinitialize them later on. Again, this applies to all Win32 DLLs.
This tip applies only to Windows NT services. Each service should have no more then one remote procedure call (RPC) endpoint/protocol. If you're merging two services into one, use named pipe aliases to allow old clients to connect. Service start/stop code should simply register/unregister RPC interfaces.
It's fairly common to set up your own private heaps for particular routines or data structures. While doing this can help you touch fewer pages as you access the data structure (since the data structure and nothing else will be on this heap), keep in mind that indiscriminant use of private heaps will result in many poorly utilized heap pages. Be sure to test different strategies before you commit to a design. This applies to all Win32 programs.
This is a huge problem for all Win32 programs.
By and large, most Win32 systems are memory-bound, meaning that the systems are often running in a state where the amount of memory needed by the programs running exceeds the amount of physical memory present on the machine. In other words, they use a lot of virtual (disk) memory.
Each time the system needs a page of memory and none is available, it has to pick a page to swap out. If that page is a read/write page, it has to be written to disk so it can be restored later. So we'll have one disk operation to write the page, and one to read it in later.
But even if the page can be discarded and loaded from the original file later, you have one disk operation to read the page when it's needed next.
If you stop to consider that each disk operation takes thousands of times longer than a simple memory access, it becomes clear why it's so important to work so hard to avoid large working sets and swapping. It's also clear why the quickest and best performance improvement for most Windows machines is simply adding more RAM.
By the way, when we have a paging system, we can't really call it RAM anymore, can we? RAM, as you'll recall, stands for "random access memory." Random access means that you can access any address in exactly the same amount of time. But that's not true for a paging system: the amount of time varies radically depending on whether the page is present or not. In fact, even systems that have caches are not really random access, since the access time depends on whether the data being accessed is in the cache or in RAM (or on disk, if you're using paging and caching, as most computers do).
But it's not only the number of pages you access that makes a difference, it's also the pattern you use to access them. For instance, let's say that we have to write a value to four megabytes (1024 4K pages) of memory. And let's say that there are only 512 pages free.
Compare what happens if we write each page completely before moving to the next to what happens if we write one DWORD in the first page, one DWORD in the second page, and so forth. In the first case, we'd generate 512 page faults. In the second, we'd generate about a million (nearly 1024 squared) page faults. If each page fault takes 10 milliseconds (ms) to service, the running time of the program goes from about five seconds (512 times 10 ms is 5120 ms) to about 2.75 HOURS (1,000,000 times 10 ms is 10,000,000 ms, or 10,000 seconds; that is, 167 minutes). Now Dr. GUI may not have used quite accurate numbers here, and this is the worst possible case, but you can see that minimizing how many times each page is accessed can make a huge difference in your program's running time. (Or you could have added more memory to avoid paging.) So the key to minimizing page swaps is to have good "locality of reference," meaning that accesses during a short time tend to be bunched on a small group of pages during rather than scattered over many pages.
So what does this mean to you? Well, one thing: trees and linked lists tend to have very bad locality of reference, especially if they're not allocated carefully. For instance, let's say we allocate one node in a linked list, then a 4K string, then the next node, then a 4K string, and so on. Each node will be in a different page, so traversing the list will be very much like the bad case above. Again, this is a degenerate case, but it proves the point: the most efficient traversal of the list will be if the list is stored in order on pages that contain nothing but list nodes.
With a little bit of planning, you can improve this situation. The simplest method is to allocate nodes in chunks (n elements at a time, perhaps with a scheme for suballocations). When you do this, at least each chunk will be in as few pages as possible—and doing this can give you considerable savings in page faults and therefore execution time.
Certainly there are many other methods for speeding up your applications, but these are important—and sometimes subtle. Try a few of these and see if you can't notice a big difference—particularly on a memory-starved system (like the 16-megabyte (16-MB) systems running Microsoft Internet Explorer version 4.0 that all your users are using).
Dr. GUI offers his thanks to: Michael Fortin, Mario Goertzel, Gurdeep Singh Pall, Jerry Shea, and Richard B. Ward.