October 1996
Matt Pietrek is the author of Windows 95 System Programming Secrets (IDG Books, 1995). He works at NuMega Technologies Inc., and can be reached at 71774.362@compuserve.com. QWay back in your July 1994 column, you wrote about shrinking the size of a 16-bit executable for Borland C++ 4.0. Do you have any suggestions for doing the same for version 5 with a 32-bit executable written in straight C? You can imagine my horror when I compiled a program consisting only of a MessageBeep call and discovered it was approximately 32KB! Fred Bulback Via CompuServe AIn my 32-bit Liposuction article this month, I describe some techniques for reducing the size of executable (EXE and DLL) files. However, Fred's question addresses a different issue. Specifically, even the smallest of C/C++ programs ends up bigger than necessary because of the overhead imposed by the runtime library (RTL). Often a simple program doesn't need all of the compiler's RTL functions to run and you end up with unneeded code in your executables. For these programs, it's possible to replace various pieces of the RTL code and save quite a bit of space. In this column I'll show where many 32-bit executables are bigger than they have to be, and show you a way to slash their sizes dramatically. I don't mean to be hard on the runtime libraries. They do a large amount of grunt work for you. My point in this column is to show that there's nothing magical about the runtime library. It's just code, and as such, you can replace it with other code. By writing your own custom initialization code, you can remove references to many of the routines that are linked into all C or C++ executables. When you remove the references to those functions, the linker in turn won't include them in the final executable, and your executable will be smaller as a result. Whenever I charge the windmills of executable file bloat, someone inevitably tells me that executable size doesn't matter. After all, hard-drive space is cheap. More importantly, on most hard drives the minimum space that a file takes up is 8KB or 16KB. Thus (as the logic goes), it's not worth trying to make your programs smaller than 8KB. I have three rebuttals to these arguments. First, in small programs most of this space is taken up by RTL code. If you don't need all this code, why waste the time to load and execute it? Second, while hard drives have large cluster sizes, floppies don't. If you distribute your code on floppy disk, cutting down the size of your executables may let you ship on fewer disks. Third, more and more software is being distributed via the Internet. Why force users with 14.4K modems to wait for useless code to download? Let's look at the simplest possible C program and see where unnecessary code is being pulled in. The smallest possible program is arguably the following: First, let's compile the program with Visual C++¨ 4.1 and try to make it as small as possible. The command line should produce an executable that's optimized for size (the /01 switch) and that doesn't contain any debug information or padding from incremental linking. The /Fm switch tells the compiler to generate a MAP file. The resultant EXE from this compile is 11,264 bytes. If you were to examine TEST.OBJ, you'd see that function main only contributes three bytes of code. Obviously, something else is taking up 11KB in the file. Figure 1 shows a portion of TEST.MAP, which gives you a clue what's included in TEST.EXE and using up 11KB. I won't describe everything in TEST.MAP, but a couple of space wasters are worth noting. All the functions with "unwind" in the name are related to C++ exception handling or Win32¨ structured exception handling. The same goes for the __XcptFilter and __except_handler3 functions. I didn't see any exception handling in my function main, did you? Also in TEST.MAP, you'll find _malloc, _free, and various other heap routines. Did you see any memory allocation in function main? I didn't think so. The __crtGetEnvironmentStringsA function retrieves the process's environment and breaks it into individual components (for instance, "PATH=..."). Once again, function main didn't do anything with environment strings, yet code to manipulate them ended up in the EXE. Hmm.... Let's look at Borland C++ 5.0 and see if it's any better. On the same TEST.C program, let's use this command line: This tells the compiler to produce an executable optimized for size, without any debug information, and with a MAP file. This time, the resultant TEST.EXE is 32,768 bytes. Wow! Figure 2 shows just a sampling of lines from the TEST.MAP created with Borland C++. Looking at Borland C++'s TEST.MAP, you'll see several of the same space wasters that I found in the Visual C++ executable, like routines for exception handling and memory allocation. What's more, the Borland C++ executable includes code for C++ runtime type information (RTTI), as well as many STDIO.H "FILE *" routines (fputs, fflush, and so on). Also, there is code for sprintf-style string formatting and floating-point math support. With all these routines, it's not hard to see why the executable is 32KB. The bigger question is, "Why do you need all this runtime library code for such a simple program?" Some of you may be thinking "Why not use the runtime library DLL that comes with the compiler?" Microsoft names this MSVCRTxx.DLL, while Borland's is named CW32xxxx.DLL. The xs correspond to the particular version number of the DLL. These DLLs remove the RTL code from your executables and put it in a DLL shared by other executables. While using the RTL DLL is certainly a viable solution for medium to large programs, it's overkill for tiny programs. For starters, code is code. It doesn't matter where you put the RTL code, it's still going to load and execute and take time to do so. More importantly, by linking to the RTL DLL, your executable won't run if the DLL isn't around. If you plan to distribute your program, you'll almost certainly need to also distribute this large RTL DLL. You can't assume that the DLL will be present on everybody's system. "Wait," you say, "the standard RTL code that my compiler supplies performs many important tasks like calling static constructors and destructors. It also sets up things so that you can use functions like malloc transparently. If I remove that initialization code, won't critical RTL functions stop working?" Yes. However, this isn't necessarily the end of the world. Let's break up the RTL code into categories to see why not. Internal initialization and shutdown functions. In some cases (such as argc/argv processing), you can replace the standard code with your own smaller version that doesn't reference other RTL functions. In other cases, you may not need a particular language feature. For instance, if you don't have any static constructors or destructors, it doesn't matter if your replacement startup code doesn't call them. User-callable functions that require initialization before they can be used, but are easily replaced. For instance, malloc assumes that the startup code has initialized the heap. However, the malloc function can easily be replaced with the Win32 HeapAlloc function, which doesn't require any initialization. Large routines that have direct (or nearly direct) equivalents in the Win32 API. For example, in many cases the Win32 wsprintf function can be used as a replacement for the C sprintf function, saving quite a bit of space in your executable. The heap functions fall into this category as well as the preceding category. Relatively small RTL functions that don't reference other functions. An example of such a function would be strchr, which finds the first occurrence of a character within a NULL-terminated string. It's perfectly OK to use the compiler-supplied versions of these functions. Why reinvent the wheel? Large routines with no equivalent Win32 functionality. If your code needs to use functions like scanf, you'll have to bite the bullet and absorb whatever size hit they impose. My point in breaking apart the RTL is to show that, for moderately simple programs, there are various types of RTL code that you can replace to save space in your executable. The key is to do as little work as possible. If it looks like a major hassle to replace something, or if required functionality won't be there, don't do it-go ahead and use the standard RTL code. I've found that many of my programs make few demands on the RTL, and are surprisingly easy to shrink using a library I've written. To show you how easy it is to shrink executables, I've created what I call TINYCRT. With just a little effort, I was able to make both Borland and Microsoft versions. A friend of mine has even used the Microsoft version of TINYCRT on the DEC Alpha. I want to point out up front that TINYCRT isn't something that you'll always want to use. It has significant restrictions and limitations. If your program can live within these confines, though, you can achieve serious size reductions. As currently implemented, TINYCRT doesn't initialize your static constructors or destructors. If you use threads and runtime library routines that need per-thread synchronization (such as strtok), these routines won't work properly. Based on reading the runtime library code, it's possible that some types of exception handling may not work, although I didn't have any trouble with exception handling in my testing. In simple terms, TINYCRT is good primarily for small, single-threaded applications that don't use static constructors and destructors. If your program doesn't work with TINYCRT, don't use it! On the other hand, quite a few programs work well with TINYCRT. I went back and switched several of my sample programs from prior MSJ columns and articles to use TINYCRT and found a 12KB savings to be the norm. Put another way, I was routinely creating 2KB or 3KB executables out of my old utilities. The heart of TINYCRT lies in replacing the standard RTL startup code that comes with your compiler. The compiler-supplied startup code goes through all sort of gyrations to initialize various things, which in turn causes the linker to bring in other RTL routines. This is what causes the 11KB Microsoft executables or the 32KB Borland executables. The TINYCRT startup code does the bare minimum: just enough to call your main or WinMain routine. Figure 3 shows CRT0TWIN.C, which is the startup code for GUI applications. Note that the only routine in the file is called WinMainCRTStartup. This is the name that the Microsoft linker uses when deciding where a GUI executable's entry point will be. The first byte of WinMainCRTStartup is where the operating system transfers control to begin execution of the program. In WinMainCRTStartup, the code first retrieves the process's command line and scans past the filename portion at the beginning. This pointer becomes the lpszCommandLine argument to your WinMain function. Next, the code calls GetStartupInfo to get the information needed for the nCmdShow argument to WinMain. A call to GetModuleHandle retrieves the executable's HMODULE. With all this information in hand, WinMainCRTStartup calls whatever WinMain function you've defined. After WinMain returns, WinMainCRTStartup code calls ExitProcess, passing whatever value WinMain returned as the exit code. Pretty easy! If you've ever looked at the startup code for 16-bit Windows¨ apps, you might remember that they had to call quasi-documented system functions like InitTask, InitApp, and WaitEvent in the right sequence before you could do anything. In comparison, the bare startup code for a Win32 app is wonderfully simple. For console applications, the startup code is a little different. Figure 4 shows CRT0TCON.C, which is my minimal version. For C and C++ console applications, the Microsoft linker looks for a function called mainCRTStartup and uses that as the entry point. The only real job for mainCRTStartup is to call your main function. However, main takes three arguments: argc, argv, and env. Strange as it may seem, the operating system doesn't hand you argc, argv, and env on a silver platter. It's the job of the runtime library code to parse the command line and environment into nice tokens. Since I rarely see the env ("environment") argument used, I punted and didn't implement any support for it. For argc and argv processing, mainCRTStartup calls a routine I wrote called _ConvertCommandLineToArgcArgv. I'll describe this function momentarily. To finish describing my minimal startup code, I had to cheat a tiny bit for Borland C++. Borland's TLINK32 uses a different method to locate the address of the executable's entry point, meaning that it won't automatically use mainCRTStartup or WinMainCRTStartup as the entry point. It would be messy to explain exactly what TLINK32 does and how I dealt with it. Suffice it to say that I wrote two very minimal assembler files, one for GUI apps and another for console applications. C032CON.ASM contains just a JMP to mainCRTStartup, while C032WIN.ASM is just a JMP to WinMainCRTStartup. For the benefit of Borland C++ users without an assembler, I've included the corresponding OBJ files in the download file. Turning to the argc and argv processing, you'll find my minimal version in ARGCARGV.C (see Figure 5). I could have used the argc and argv processing that comes with the compiler's RTL. However, these versions call several other RTL routines that cause the linker to bring in quite a bit of additional code. I was able to write my own version in a little over 256 bytes of code. It won't always give you the identical results to what you'll get from your compiler's RTL, but it's pretty good for your typical command line requirements. As a side benefit to writing my own argc/argv processing, you can use the _ConvertCommandLineToArgcArgv in your GUI applications. That's right! If you've ever wanted argc/argv-style arguments in a GUI program, include ARGCARGV.H in your source and call the _ConvertCommandLineToArgcArgv function. The return value is the number of arguments (that is, argc). The _ppszArgv global variable is an array of string pointers to each argument. The remainder of the TINYCRT files are replacements for some common RTL routines that come with your compiler. Why replace your compiler's RTL routines? In many cases, you can take advantage of Win32 system functions to write smaller versions of the routine in question. In other cases, the compiler-supplied version of a routine may be overkill, and you can replace it with something that's good enough for your requirements. For example, many functions that work with characters and strings (such as strcmpi) are actually multibyte-enabled (MBCS). The code to implement true multibyte support in the RTL is relatively large. If your code will always use single-byte (ANSI) characters, it's worthwhile to prevent the multibyte code from being linked in. If you use TINYCRT, you may want to replace additional RTL functions beyond what I've implemented here. To replace a function, simply write your own version, add it to the list of OBJ files in the appropriate makefile (TINYCRT.MS or TINYCRT.BOR) and rebuild. The TINYCRT sources provide numerous examples. I won't describe every routine that I've replaced, but I will mention a few of the more interesting replacements here. The first routines I replaced when writing TINYCRT were printf and sprintf, since this family of functions takes up several KB of memory. The Win32 API has the sprintf and wvsprintf functions, which in most cases are perfectly adequate. The Win32 version of these functions doesn't handle floating point values or some of the more esoteric formatting that the ANSI specification allows. But, the Win32 routines have almost always been good enough for the programs I write. Implementing printf is a straightforward affair. The code in PRINTF.C (see Figure 6) passes its input arguments to wvsprintf, which formats them into a buffer. Next, I call WriteFile, passing it the string that wvsprintf has formatted. Since printf writes its output to stdout, I need the file handle for stdout. A call to the function GetStdHandle returns exactly the handle I need. The printf replacement is an excellent example of how a standard RTL can be replaced with a series of Win32 system calls and a little glue code. The next set of routines that TINYCRT replaces are the standard memory allocation functions (malloc, free, new, delete, and so on). Like printf and friends, the memory- management functions typically take up several KB of memory. The Win32 API's HeapAlloc family of functions substitutes quite nicely for the C/C++ heap routines. For instance, my implementation of malloc is dead simple: It's important to point out here that using HeapAlloc will almost always be slower than using the compiler's built-in memory allocator. If performance is an issue, by all means use your compiler's built-in heap routines. The purpose of TINYCRT is to produce the smallest executables possible, not the fastest. Incidentally, Visual C++ 4.1 uses HeapAlloc to implement malloc and new, but Visual C++ 4.2 went to a different scheme to improve allocation performance. I replaced the remaining functions (strupr, strlwr, By implementing these locale-enabled functions with either Win32 equivalents or by rewriting them from scratch, you prevent the locale code from being linked in. In the case of Visual C++, an internal routine ( _isctype ) is used by quite a few RTL functions and drags in the locale code. The TINYCRT version of _isctype (ISCTYPE.C, in Figure 7) doesn't use locales so it can make the executable much smaller. The selection of routines that I replaced is entirely arbitrary. I came up with the list by linking many of my programs with TINYCRT and determining which RTL routines took up the most space in my executables. The linker MAP file is your friend in determining how much space each function uses. As I identified routines that could be replaced (for instance, printf, malloc, and strcmpi), I wrote replacements and added them to TINYCRT. No scientific method here. Without a doubt, there are additional functions that could be added to TINYCRT. For instance, the entire set of stdio FILE * functions (fread, fwrite, fprintf, and so on) would be a nice addition. Bear in mind that the compiler-supplied versions of these functions implement sophisticated caching to improve performance. Anything you write in a couple of lines of code most likely won't be as fast. Again, this comes back to my point that TINYCRT is supposed to create small executables. If it's not fast enough for your needs, or if it can't handle all of your program's requirements, don't use it. You're no worse off than you were in the first place. If you use Visual C++ (or a compatible compiler), TINYCRT is extremely easy to use. I deliberately wrote it so that you don't have to change any of your source code. It's very easy to switch between using and not using TINYCRT-this way, you can make sure everything is working correctly, and also see if you succeeded in decreasing the size of your EXE. The TINYCRT library for Visual C++ is called LIBCTINY.LIB and is built from TINYCRT.MS. If you want a debug version of the code, just define DEBUG=1 on the NMAKE command line and rebuild. Using LIBCTINY.LIB couldn't be simpler. Just add it to the list of library files in your project or makefile. If you explicitly link to LIBC.LIB or LIBCMT.LIB, make sure LIBCTINY.LIB appears before those files in the library list. For ease of use on my system, I've put LIBCTINY.LIB in the LIB directory where the other Visual C++ libraries live. If you're a dinosaur like me and still build programs entirely from the command line, LIBCTINY.LIB can be specified on the compiler command line: If you don't see the size improvements that you'd expect from TINYCRT with Visual C++, I'll let you in on something I learned the hard way. Both LIBCTINY.LIB and the compiler-supplied LIBCxx.LIB implement versions of printf, malloc, and so forth. The linker has to pick just one version of each referenced function and include it in the executable. When resolving references, LINK searches the libraries in the order they were presented to the linker. Default libraries like LIBC.LIB are searched last. Once LINK selects a routine from a particular library, that library effectively moves to the head of the library search order. (See KnowledgeBase article Q31998 for a more detailed description.) Thus, if your program uses routines from the default compiler RTL, the default RTL version of other functions may also be used even though LIBCTINY.LIB has a version of the function. If you don't live and breathe linkers, the above paragraph probably sounded like Greek. If it made sense to you, you know your linker rules and there's some additional help that LINK can offer when you're trying to figure out which version of a function was linked in, and why. Check out the /VERBOSE linker switch. It may take a little while to figure out exactly what the output tells you, but I found it very helpful in understanding why the standard LIBC version of functions were being used even though my LIBCTINY had its own version of the function. When using LIBCTINY.LIB, you may encounter a linker error like this: This is to be expected, given the Byzantine library search process described above. You can work around this by using the /FORCE:MULTIPLE switch. The things you gotta do sometimes. To show a small example of linking with LIBCTINY.LIB I wrote the TEST program, which exercises some of the functionality that TINYCRT implements. The source file is TEST.CPP, and the TEST.MS file is a small makefile that builds TEST using LIBCTINY.LIB. These files aren't listed here, but can be downloaded from http://www.msj.com. The Borland C++ version of TINYCRT is called CW32TINY.LIB and is built from TINYCRT.BOR. A debug version can be built by defining DEBUG=1 on the MAKE command line. The CW32TINY.LIB file should go ahead of CW32.LIB or CW32MT.LIB in TLINK32's library list. In addition to including CW32TINY.LIB in the library list, you'll also need to replace C0W32.OBJ or C0X32.OBJ. If you have a GUI application, replace C0W32.OBJ with C032WIN.OBJ. Otherwise, for console mode programs, replace C0X32.OBJ with C032CON.OBJ. To demonstrate a minimal example of a Borland C++ program using TINYCRT, I made the TEST program (mentioned above) buildable using Borland C++ 5.0. To build it, run MAKE on the TEST.BOR makefile. To see how CW32TINY.LIB and the C032xxx.OBJ files are used, check out the contents of TEST.BOR. TINYCRT isn't a panacea. It has a very specific goal: to make those little programs you create as small as they possibly can be. In writing TINYCRT, size and ease of use always took precedence over speed and 100 percent ANSI compatibility. However, I found that many of my everyday programs are able to use TINYCRT without a hitch. The beauty of TINYCRT is that you're free to implement and add whatever you want to it! I'd like to thank John "Senior Boy" Robbins for his help with TINYCRT. John wrote a couple of the RTL replacement routines and was my beta tester. He also was my sounding board and offered numerous suggestions. Have a question about programming in Windows? Send it to Matt at 71774.362@compuserve.com Size Does Matter
So Why Are Executables Fat?
// TEST.C
int main()
{
return 0;
}
CL /O1 /Fm TEST.C
BCC32 -O1 -M TEST.C
Breaking It Down
The TINYCRT Library
RTL Routines
return HeapAlloc( GetProcessHeap(), 0, size );
atol, atoi, strcmpi, stricmp, and so on) to eliminate what I call the "locale" hit. All of these functions (as well as many others that I didn't replace) are locale-enabled, which means they'll work with double-byte character sets. The code to implement locale support takes up a good chunk of memory. If you only use single-byte characters (that is, the ANSI character set), this code is never executed and is wasted space. Using TINYCRT with Visual C++
CL /O1 TEST.CPP LIBCTINY.LIB
<< start output >>
LIBC.lib(crt0.obj) : error LNK2005: _mainCRTStartup
already defined in libctiny.lib(CRT0TCON.OBJ)
test.exe : fatal error LNK1169: one or more multiply
defined symbols found
<<end output >>
Using TINYCRT with Borland C++
Wrap Up