Patron Files with the Jitters

Patron is designed to manage a document composed of pages, with each page eventually managing any number of tenants—bitmaps, metafiles, compound document content objects, and controls. If we were to implement Patron using traditional file I/O, we would create a file with the three primary structures listed on the following page and illustrated in Figure 7-4.

Figure 7-4.

A possible Patron file with traditional file I/O.

Certainly this sort of layout is manageable, albeit tedious. To write such a file, we would first write the file header, then all the page headers, and then all the tenants. Writing the file header is simple because we know its size and can easily calculate the offset of the first page to store in this header. Before writing the page headers, however, we need to build the entire list in memory to determine the total size of the page list and store the appropriate tenant offsets in each page structure. When we have this list, we write it all out to disk at once and then write each tenant in turn. Besides a little tedium in calculating the offsets, this code would be simple enough and performance would be good—all the pages are at the front of the file, and large seeks occur only when a particular tenant is accessed.

Let's say now that an end user deletes a page and then saves the file again. We have two alternatives. We can rewrite the entire file (typically the easiest option), or we can mark the structure for the deleted page as "unused" and mark its tenant structures as "unused" as well. The second option would not reduce the file size but would allow a fast save. This is a highly demanded feature in today's applications, so we opt for the second choice.

Now the end user adds a page in the middle of the document and adds a few tenants to that page. The end user saves again, giving us the following three choices for where to put this page:

The first two options would allow an incremental fast save again, but they negate the idea that all the page structures would be stored sequentially in the file introducing the complexities of a fragmented file. The third option has the potential to be horrendously slow, especially if some tenants (a true color bitmap, for instance) have large amounts of data.

At this point, we perform the engineering exercise called "compromise" and weigh the possible options: we can have fast saves with file fragmentation, or we can have files efficient to read but slower than molasses to save. In a performance-driven market, we choose fast saves to get good timing reviews in trade magazines. Get a few developers to put in some overtime, and you'll have a wonderfully elaborate garbage collection scheme for managing free space in your files as best you can, just as you would handle free space in any memory manager. You would allow fast saves for performance and give the user the option of a full save, which would rewrite the file completely, effectively defragmenting the whole thing. Cool. We just turned a problem into a couple of features by writing a great deal of file management code. Yes! Gives us the satisfaction of a Double Tall Macho Grande Latte.

On an actual system, all of our garbage collection and defragmentation work buys us very little unless the end user defragments the hard disk. A defragmentation scheme managed by an application works on what the application sees as the contents of the file, faithfully reproduced by the file system. But that same file system probably has the actual file contents spread all over the physical disk itself. All the work we did to make the file look good has little to do with real performance on a fragmented disk: a 128-byte seek in the file might equate to a 128-MB seek on disk, and a 10-MB seek in the file might actually seek only 10 KB on disk.

After we've put in all the work—and compromised a number of other features we really wanted in the application—we find that our cool file management code does little more than keep the file size to a minimum. It bought us little in the way of real performance. In other words, we missed the design goal completely and the jitters set in, making us truly pay for that latte.