Structured Storage and Compound Files

A long time ago in the computer industry, there was a one-to-one relationship between the computer and the single application that ran on it. That application had total control over all system resources, including all storage devices. It was the innovation of the operating system along with a "file system" that enabled multiple applications to share system resources. The file system was specifically responsible for allowing those applications to share the same storage device, and it did so by partitioning the disk into directories and files. Whereas the application saw a file as a flat byte array, the file system stored the information in noncontiguous sectors around the device.

A component integration environment requires something more than what a file system can offer. In such an environment, different components need to share a single disk file, just as applications once needed to share a single disk drive. OLE's Structured Storage is a specification that defines a number of storage-related interfaces to achieve exactly this, defining a "file system within a file." Instead of requiring that a single file handle with a single seek pointer manipulate a large contiguous sequence of bytes on the disk, Structured Storage describes how to treat a single file-system entity as a structured collection of two types of objects—storages and streams—that act like directories and files. Together, storages and streams provide powerful features such as transactioning and incremental access, as we'll see in more detail in Chapter 7.

A stream object, which implements the interface IStream, is the conceptual equivalent of a single disk file as we understand disk files today. Streams are the basic file-system component in which data lives, and each stream in itself has access rights and a single seek pointer. Streams are named by using a text string (up to 31 characters) and can contain any internal structure you want.

A storage object, using the interface IStorage, is the conceptual equivalent of a directory. Each storage, like a directory, can contain any number of storages (subdirectories) and any number of streams (files), as shown in Figure 1-11. In turn, each substorage can contain any number of storages and streams until your disk is full. Storages themselves do not contain any user-defined data as do streams, but rather, storages manage the names and locations of the elements within them. Like a stream, a storage object is named with a text string and has access rights (compared with typical file-system directories, which commonly do not have access rights). Given a storage, you can ask it to enumerate, copy, move, rename, delete, or change dates and times of the elements within it, much as you can achieve through command prompts.

Figure 1-11.

Conceptual structure of storage and stream objects.

OLE provides as a service an implementation of structured storage that is called Compound Files.10 You can use this technology to replace traditional file handle–based API functions such as _lread and _lwrite, as we'll see in Chapter 7. To be perfectly accurate, Compound Files is an implementation of Structured Storage that is specifically directed to a disk file. Through a small customization called a lockbytes object (which implements ILockBytes), you can direct all the information to another location, such as memory, a database record, or even a portion of another file. In fact, Compound Files is really just a disk-based lockbytes object plugged into OLE's otherwise independent storage implementation.

Because the lockbytes object controls only the ultimate storage medium of bytes in a compound file, OLE itself controls the actual data structures and the underlying file format. As a file system makes disparate sectors on a disk appear as a contiguous byte array, OLE makes disparate blocks of data in a file appear as contiguous streams and also provides automatic garbage collection and defragmentation features. However, Microsoft recognizes that many vendors ship applications for platforms other than Windows or the Macintosh for which Microsoft has provided the Compound Files implementation itself. For that reason, Microsoft licenses the straight ANSI C++ source code for Compound Files that you can recompile for other platforms as needed.

There is one tremendous advantage to using structured storage: because the hierarchy of storage and stream elements is stored in a standard format and is accessed through a standard OLE service, anything can browse through a hierarchy in a file without having to run the code that created the file. Although the format of information in streams is still proprietary, the names and locations of those streams within the file are not. Therefore, the system shell can include browsing tools to examine the structure of the file.

This advantage is further enhanced when a standard does exist for specific types of information. The only standard that currently exists is for a stream (located off the root storage) called "\005SummaryInformation" (ASCII 5 is the first character in the name), which contains document information such as author, title, subject, keywords, comments, creation/save/print date, word count, and so on. This information is stored in a format known as a property set and is covered in Chapter 16 (with other property-related topics). Because the stream has a standard format, a standard location, and a standard name, anyone and anything can retrieve this information and do interesting things with it. For example, the Windows 95 Explorer allows you to search for documents and then view their summary information.

The long-term plans for structured storage include full content indexing of a file to enable shell-level searches based not only on summary information but also on content. This capability is far more powerful, yet easier to use, than requiring the end user to first find a file, then find the application that can load that file, and then use the application to open and browse files to eventually find the data. Content indexing will generally work through vendor-supplied content filter objects that crack the proprietary stream formats within a file and return indexing information to the system. The first manifestation of these types of objects (not covered in this book) are the file viewers in Windows 95, which provide a quick way to view the contents of a file without having to load the entire application that created it. It is a powerful addition, done completely through OLE, and is all part of Microsoft's Information at Your Fingertips philosophy.

10 Formerly called "docfiles" which is considered archaic but is still in use because it rolls off the tongue so nicely. Note that "compound files" bears no relation to "compound documents" except that a compound file is an excellent medium in which to store such a document.