Compound File Defragmentation

Compound files provide incremental saves inherently, so the physical size of a compound file on disk will typically be larger than necessary. This is because the size of the file is determined by the amount of space between the first and last sectors used by that file. This is like calculating free space on your hard disk by using the location of the first and last files stored on it instead of by the number of actual unused sectors: with this method, you could have two 1-KB files on a 1-GB disk, but because they are located at opposite ends of the drive, the disk is considered full.

Although this does not actually happen on hard disks, it can happen within the confines of a storage hierarchy. There might be plenty of unused space inside the storage medium itself, but the size of that medium, as reported by the operating system for something like a file, is defined by the first and last sectors used, regardless of the amount of internal free space. This means that the possibility of internal fragmentation and larger than necessary files (or other mediums) always exists, as shown in Figure 7-9.

Figure 7-9.

A fragmented storage that takes up more room than necessary.

The IStorage::CopyTo function will remove all the dead space in the process of copying the contents of one storage to another and will order the contents of streams sequentially, as shown in Figure 7-10.

Figure 7-10.

IStorage::CopyTo removes dead space and orders stream contents within the destination storage.

The Fragmenter sample (CHAP07\FRAGMENT) illustrates this process. Compile and run this program. You won't see any main window—instead you'll see a message appear after a little while that says "FRAGMENT.BIN created." This means that Fragmenter has finished creating a compound file with 26 streams, each of which contains 5120 characters. The first stream, called Stream A, contains A characters, Stream B contains B characters, and so on. These streams are not written sequentially; rather, they are written 256 characters at a time through 20 iterations of the alphabet. When the first message appears, you can look at the contents of the file to see that there are essentially 10 sections of 512 characters each because streams are allocated to a 512-byte granularity. At this point, the file itself will be 219,136 bytes.

Now close the message box, and after a short time Fragmenter will display the message "Space has been freed in FRAGMENT.BIN." After you closed the first message, Fragmenter deleted the streams C, E, G, H, J, M, N, T, and X, freeing a significant portion of the space in the file before closing it again. Now look at the binary contents of the file once more. You'll see that all the original information is there. What gives? OLE only marked the space occupied by those streams as unused, but it doesn't need to bother overwriting their contents. (If you want deleted information to be secure, overwrite the stream before deleting it.) All the original information still exists, the file is the same size, and all that have changed are a few bytes marking blocks of data as used or unused.

Now close this second message box. After another pause, you'll see the message "Defragmentation complete on FRAGMENT.BIN." Here is where Fragmenter created a new file, then called the IStorage::CopyTo function to copy the storage contents to that new file, and then deleted the old file and renamed this new file to FRAGMENT.BIN. If you look at the file again, you'll now see that not only are all the unused blocks (all the deleted character streams) gone, but also all the characters—all 5120 of each type—are sequential in the file. The file itself is now only 91,136 bytes.

This process illustrates how to defragment any compound file. You can use this technique to compress files from your own application or perhaps build an end-user tool that will do the same.