Text

Each word processor has its own way of representing formatting and special characters. We do not want to create a separate process for each type of word processor we receive. So instead, we convert the format we receive into an intermediate format. The intermediate format has no other use than to provide us with a standard starting place for the rest of the transformation.

The intermediate format is a simple Microsoft Word for Windows® document with character and paragraph formatting. The character formatting is the same as in the original text (i.e., italics, bold, and other character attributes were simply transformed into Word for Windows character formats). The paragraph formatting is done with Word for Windows styles. Paragraphs with the same purpose (e.g. first level headings, figure captions, bulleted text, etc.) are given the same style name.

The style names provide the basis for the automated processing that follows. For example, and most importantly, the hierarchical structure of the information is represented by using Word for Windows heading styles. With heading styles applied, we can later see the content in Word for Windows outline viewer and can reconstruct the table of contents automatically.

Approximately 90% of the preprocessing is done by conversion routines created in Word's programming language WordBasic. The remaining 10% is done manually (which, of course takes hundreds of times longer than the first 90%).

At the end of preprocessing, we have a set of content files that are consistently formatted regardless of the format in which they had arrived.