Definitions
OLE 2.0:
Object Linking and Embedding 2.0
API (Application Programming Interface):
A set of libraries, functions, definitions, etc. which describe an interface to a programming environment or model.
docfile:
An OLE 2.0 compatible multi-stream file. Word files are docfiles.
page (or sector):
512 byte segment of the main stream within a Word docfile that begins on a 512-byte boundary. (bytes 0-511 are in page 0, bytes 512-1023 are in page 1, etc.). In Word data structures, an unsigned two-byte integer page number is given the acronym PN (for Page Number).
document:
A named, multi-linked list of data structures, representing an ordered stream of text with properties that was produced by a user of Microsoft Word
stream:
The physical encoding of a Word document 's text and sub data structures in a random access stream within a docfile.
main stream:
The stream within a Word docfile containing the bulk of Word's binary data.
table stream:
The stream within a Word docfile containing the various plcf's and tables that describe a documents structures.
data stream:
The stream within a Word docfile containing various data that hang off of characters in the main stream. For example, binary data describing in-line pictures and/or formfields.
summary information stream:
The stream within a Word docfile containing the document summary information.
object storage:
A storage containing binary data for an embedded OLE 2.0 object.
CP (Character Position):
A four-byte integer which is the position coordinate of a character of text within the logical text stream of a document.
FC( File Character position):
A four-byte integer which is the byte offset of a character (or other object) from the beginning of a stream of the docfile. Before a file has been edited(i.e. in a full saved Word document), CPs can be transformed into FCs by adding the FC coordinate of the beginning of a document's text stream to the CP. After a file has been edited (i.e. in a fast-saved Word document), the mapping from CP to FC is recorded in the piece table (see below)
XCHAR( eXtended CHARacter set):
A data type which defines a "character". Each XCHAR corresponds to a character in the document, where "character" is defined as a glyph, regardless of whether it is a single-byte or double-byte character. With Word6/FE, Word95/FE, Word97/all and future versions of Word, this is defined as a 16-bit integer corresponding to the Unicode character code of the glyph.
PLF(PLex stored in File):
A data structure consisting of an array of structures preceded by a long count of structures.
PLCF(PLex of Cps(or FCs) stored in File):
A data structure consisting of two parallel arrays that allows a relation to be established between a certain CP position in the document text stream (or FC position in a file) and an arbitrary data structure. It consists of an array of n+1 CPs or FCs followed by an array of n instances of a particular arbitrary data structure. In typical usage, the nth CP or FC of the PLCF is in one-to-one correspondence with the nth instance of the arbitrary data structure, with the n+1st CP or FC marking the limit of the nth instance's influence. When a PLCF is used to record a partitioning of the document's text stream or a partitioning of the bytes stored in a file, the 0th CP/FC stored in the PLCF will be 0. When a PLCF is used to record the location of certain marks or links within the document text stream, the 0th CP/FC stored in the PLCF will record the position of the 0th mark or link. To properly interpret a PLCF stored in a Word file, the length of the stored PLCF and the length of the arbitrary data structure stored in the PLCF must be known. The length of the stored PLCF is recorded in the FIB. The lengths of the data structures stored in PLCFs within Word files are listed later in this document.
piece table:
The piece table is a data structure that describes the logical sequence of characters in a Word document and records recent changes to the formatting of a Word document. It is stored in a Word file as a PLCF named the plcfpcd (PLex of Cps containing Piece Descriptors).The piece table relates a logical character number, called a CP (Character Position), to a physical location within a Word file (an FC). The array of CPs in the plcfpcd defines a partitioning of the Word document into disjoint pieces. The second array is an array of PCDs (Piece Descriptors) which is in 1-to-1 correspondence to the array of CPs that records the physical location in the Word file where the corresponding piece begins. To find the physical location of a particular logical character in a Word document, take the CP coordinate of that character within the document and find the piece that contains that character. This is done by finding the index of the largest CP in the array of CPs that is less than the character CP. Then reference the PCD with that index in the array of PCDs. The FC stored in the PCD gives the position of the beginning of the piece in the file. Finally, add the offset of the desired character from the beginning of its piece to the FC of the beginning of the piece. This gives a "virtual" file offset of the character. If the second most significant bit is clear, then this indicates the actual file offset of the unicode character (two bytes). If the second most significant bit is set, then the actual address of the codepage-1252 compressed version of the unicode character (one byte), is actually at the offset indicated by clearing this bit and dividing by two.
sprm (Single PRoperty Modifier):
An instruction to modify one or more properties within one of the property defining data structures (CHP, PAP, TAP, SEP, or PIC). It consists of an operation code which identifies the field(s) to be changed, and an operand which gives the value that a particular field is changed to or else which is a parameter to a procedure which will change the field or fields. A prl (property modifiers stored in a list) is a sprm plus its operand.
grpprl (group of prls):
A grpprl is a data structure that records a set of sprms. The 0th sprm is recorded at offset 0 of the structure. Any succeeding sprms are recorded immediately after the end of the preceding sprm . To traverse a grpprl and locate the sprms recorded within it, it's necessary to fetch the opcode of the first sprm, lookup the length of the sprm with that opcode, use that length to skip past the first sprm, fetch the opcode of the second sprm, lookup the length of that sprm, use the length to skip the second sprm, and so on. See the table in the "SPRM Definition" topic to determine the length of a sprm.
The phrase "apply the sprms of a grpprl (or papx or sepx)" used later in this document means to fetch the 0th sprm recorded in the grpprl and perform the action for that sprm, fetch the first sprm and perform its action, and continue this procedure until all sprms in the grpprl (or papx or sepx) have been processed.
prm (PRoperty Modifier):
A field in piece table entries that records how the properties of text within a piece were changed to reflect user formatting operations. The prm usually contains an index to a grpprl which records the user's formatting changes as a group of sprms. If the user has made only a small change to formatting that can be expressed as a single 2 or 1-byte sprm, that sprm is stored within the prm.
STTBF (STring TaBle stored in File)
Word has many tables of strings that are stored as Pascal type strings. STTBFs consist of an optional short containing 0xFFFF, indicating that the strings are extended character strings, a short indicating how many strings are included in the string table, another short indicating the size in bytes of the extra data stored with each string and each string followed by the extra data. Non-extended charater Pascal strings begin with a single byte length count which describes how many characters follow the length byte in the string. If pst is a pointer to an array of characters storing a Pascal style string then the length of the string is *pst+1. In an STTBF Pascal style strings are concatenated one after another until the length of the STTBF recorded in the FIB is exhausted. Extra data associated with a string may also be stored in an sttbf. When extra data is stored for an STTBF, it is written at the end of each string. For example: The extra data for an STTBF consists of a short. If the string "Cat" were stored, the actual entry in the string table would consist of a length byte containing 3 (3 for "Cat") followed by the bytes 'C' 'a' 't', followed by the 2 bytes containing the short. Extended character strings are stored just the same, except they have a double byte length count and each extended character occupies two bytes.
full-saved (or non-complex) file:
A Word file in which the physical order of characters stored in the file is identical to the logical order of characters in the document that the file represents. The text stream of a non-complex file can be described by an fc (an offset from the beginning of the file) to mark where the text begins and a ccp (count of CPs) to record how many characters are stored in the text stream. Due to unicode compression to code page 1252, all files (simple and complex) now contain a piece table. However, a full-saved piece table will not have property modifiers (prms) and all text in the file will be referenced by the piece table.
fast-saved (or complex) file:
A Word file in which the physical order of characters stored in the file does not match the logical order of characters in the document that the file represents. A piece table must be stored in the file to describe the text stream of the document. Due to unicode compression to code page 1252, all files (simple and complex) now contain a piece table.
FIB (File Information Block):
The header of a Word file. Begins at offset 0 in file. Gives the beginning offset and lengths of the document's text stream and subsidiary data structures within the file. Also stores other file status information.
paragraph
A contiguous sequence of characters within the text stream of a document that is delimited by a paragraph mark, cell mark, row mark, or a section mark (These are special characters described later in this document).
run of text
A contiguous sequence of characters within the text stream of a document that have the same character formatting properties. A single run may cross paragraph boundaries and may encompass the entire document.
section
A contiguous sequence of paragraphs within the text stream of a document that is delimited by a section mark or by the final paragraph mark at the end of a document. Users frequently treat sections as the equivalent of a chapter in a book. The boundaries of sections mark locations where the layout rules for a document (number of columns, text of headers and footers to use, whether page numbers should be displayed, etc.) are changed.
paragraph style
A named set of character and paragraph properties that can be associated with any number of paragraphs in a Word document's text stream. A paragraph style provides a set of character and paragraph property defaults for the text of any paragraph tagged with that style. When a new paragraph is created and given a particular style, newly typed text is given the character and paragraph properties of that style unless the user makes an exception to the paragraph style definition by performing other editing operations.
CHP (CHaracter Properties)
The data structure describing the character properties of a run of text.
CHPX (Character Property EXception)
A data structure which describes how a particular CHP differs from a reference CHP. In Win Word 6.0, the CHPX simply consists of a grpprl which is applied to the reference CHP to produce the originally encoded CHP. By applying a CHPX to the character properties (CHP) inherited by a particular paragraph from its style, it is possible to reconstitute the CHP for the portion of the character run that intersects that paragraph
character style
A named character property exception that can be associated with any number of runs of text in a Word document's text stream. When a run of text is tagged with a particular character style, a chpx recorded for the character style is applied to the character properties that are defined for the paragraph style of the paragraph that contains the text. This means that the character style can change one or more of the character property field settings specified by the paragraph style of a paragraph to a particular setting without changing the value of any other field.
PAP (PAragraph Properties)
The data structure which describes the properties of a particular paragraph.
PAPX (PAragraph Property EXception)
A data structure describing how a particular paragraph's properties differ from the paragraph properties of the style assigned to the paragraph. By applying a PAPX to the paragraph properties (PAP) inherited by a particular paragraph from its style, it is possible to reconstitute the PAP for that paragraph. The PAPX contains an ISTD (a style code to identify the style in control of the paragraph and a grpprl which specifies how the style's paragraph properties must be changed to produce the paragraph properties of the paragraph.
table row:
A contiguous sequence of paragraphs within the text stream of a document that is partitioned into subsequences of paragraphs called cells. The last paragraph of each cell is terminated by a special paragraph mark called a cell mark. Following the cell mark that ends the last cell of a table row, the table row is terminated by a special paragraph mark called a row mark. When Word displays a table row, it assigns a rectangular shaped display area to each cell in the row. All of the cell display area's top's are aligned at the same vertical position on a page. The leftmost display area in a table row is assigned to the 0th cell of the row; the next display area to the right is assigned to the 1st cell of the row, etc. The text of the cell is wrapped to fit its display area. As more text is added to the cell, the cell display area extends downward. A set of table properties that determine how many cells are in a row, where the horizontal boundaries of cell display areas are, and what borders are drawn around each cell in the table is stored for the row mark that marks the end of the table row.
TAP (TAble Properties):
The data structure which describes the properties of a single table row. The information in the TAP for a table row is stored in a Word file as a list of sprms that modify a TAP which has been cleared to zeros. This list of table sprms is appended to the grpprl of paragraph sprms that is recorded in the PAPX for the row mark that delimits the end of a table row.
STSH (STyle SHeet)
A data structure which represents every style defined within the Word document. The STSH records a unique name string for every style and associates each name with a particular CHP and/or a PAP. The indexes used to refer to individual styles are called ISTDs (Indexes to STyle Descriptors). Every PAPX for every paragraph recorded in a document contains an ISTD which identifies the style from which a paragraph inherited its default character and paragraph properties. CHPXs recorded for the text within the paragraph and PAPXs recorded for the paragraph itself encode changes that the user has made with respect to the style's default properties.
FKP (Formatted disK Page):
A data structure that fits in one 512-byte page that encodes either the character properties or the paragraph properties of a certain portion of a Microsoft Word file. An FKP consists of four components:
1) a count of the number of runs or paragraphs described by the page.
2) an array of FCs recorded in ascending order demarcating the boundaries between runs or paragraphs that are recorded adjacent to one another in the Word file.
3) In character FKPs an array of offsets within the FKP in one to one correspondence with the array of FCs that locate the properties of the run that begins at a particular FC.
In LVC FKPs an array of offsets within the FKP in one to one correspondence with the array of FCs that locate the LVCXs that describe the run that begins at a particular FC.
In paragraph FKPs an array of BX structures follows the array of FCs in one to one correspondence with the array of FCs. Each BX begins with an offset that locates the properties of the paragraph that begins at a particular FC. The remainder of the BX contains a PHE structure that encodes information about the height of the paragraph that begins at that FC.
4) a group of CHPXs if the FKP stores character properties, a group of PAPXs if the FKP stores paragraph and table properties, or a group of LVCXs if the FKP stores paragraph level and numbering cache information
To find the CHPX/PAPX corresponding to a particular character in a document, calculate the FC coordinate for that character. Then search through the bin table (see next entry) for the type of property you want to produce, to find the FKP in the document stream whose array of FCs encompasses the FC of the document character.
Then search within the FKP to find the index of the largest FC entry that is less than or equal to the FC of the document character. Use this index to look up an offset in the array of offsets (for character FKPs) or look up an offset in the array of Bxs (for paragraph FKPs) within the FKP. Add this offset to the beginning address of the FKP in memory. This will be the first byte of the desired CHPX/PAPX.
bin table
Each FKP can be viewed as bucket or bin that contains the properties of a certain range of FCs in the Word file. In Word files, a PLC, the plcfbte (PLex of FCs containing Bin Table Entries) is maintained. It records the association between a particular range of FCs and the PN (Page Number) of the FKP that contains the properties for that FC range in the file. In a complex (fast-saved) Word document, FKP pages are intermingled with pages of text in a random pattern which reflects the history of past fast saves. In a complex document, a plcfbteChpx which records the location of every CHPX FKP must be stored and a plcfbtePapx which records the location of every PAPX FKP must be stored. In a non-complex, full-saved document, all of the CHPX FKPS are recorded in consecutive 512-byte pages with the FKPs recorded in ascending FC order, as are all of the PAPX FKPS. A plcfbteLvcx serves the same purpose for LVCX FKPS.
In a full save document, the plcfbte's may not have been able to be expanded during the save process due to a lack of RAM. In that situation, the plcfbte's will be interspersed with the property pages in a linked list of FBD pages.
SEP(SEction Properties)
The data structure describing the properties of a particular section.
SEPX(SEction Property EXceptions)
A data structure describing how the properties of a particular section differ from a Word-defined standard SEP. As in the PAPX, the differences between the SEP for a section and the standard SEP are encoded as list of sprms that describe how the standard SEP can be transformed into the section's SEP. By applying a SEPX's sprms to the standard SEP, it is possible to reconstitute the SEP for that section.
The PLCFSED, a data structure stored in a Word file, records the locations of all SEPXs stored in a Word file. The array of CPs in the plcfsed records the boundaries of sections in the Word document . The second array in the plcf, an array of SEDs (SEction Descriptors), is in 1-to-1 correspondence to the array of CPs. Each SED stores the beginning FC of the SEPX that records the properties for a section. If the FC stored in a SED is -1, the section properties of the section are exactly equal to the standard section properties.
The SEP for a particular section may be constructed if a CP of a character in that section is known. First search the array of CPs in the PLCSED for the index of the largest CP that is less than or equal to the CP of the character. Use this index to locate the SED in the plcfsed which describes the section. The FC stored in the SED is the offset from the beginning of the Word file at which the SEPX is stored. If the stored FC is equal to 0xFFFFFFFF, then the SEP for the section is exactly equal to the standard SEP (see SEP structure definition) Otherwise, read the SEPX into memory and create a copy of the standard SEP. Finally, apply the sprms stored in the SEPX to the standard SEP to produce the SEP for a section.
DOP (DOcument Properties)
The data structure describing properties that apply to the document as a whole.
sub-document
A separate logical stream of text with properties for which correspondences with the main document text are maintained. Word's headers/footers, footnotes, endnotes, macro procedure text, annotation text, and text within textboxes are kept in separate subdocuments. Each subdocument has its own CP coordinate space. In other words, data structures are stored in Word files that are components of these subdocuments. These data structures contain CP coordinates whose 0 point is the beginning of the subdocument text stream instead of the beginning of the main document text stream.
In full-saved documents, a simple calculation with values stored in the FIB produces the file offset of the beginning of the subdocument text streams (if they exist). The length of these streams is also stored.
In fast-saved documents, the piece tables of subdocuments are concatenated to the end of the main document piece table. In this case, to identify the beginning of subdocument text , you must sum the length of the main document text stream with the lengths of any subdocument text streams stored ahead of the subdocument (information stored in the FIB) and treat this sum as a CP coordinate. To retrieve the text of the subdocument, you must do lookups in the piece table, starting with the piece that contains the beginning CP coordinate, to find the physical location of each piece of the subdocument text stream.
field
A field is a two-part structure that may be recorded in the CP stream of a document. The first part of the structure contains field codes which instruct Window's Word to insert text into the second part of the structure, the field result. Fields in Window's Word are used to insert text from an external file or to quote another part of a document, to mark index and table of contents entries and produce indexes and tables of contents, maintain DDE links to other programs, to produce dates, times, page numbers, sequence numbers, etc. There are 91 different field types.
A field begin mark delimits the beginning of a field and precedes any of the field codes stored in the field. The end of the field codes and the beginning of the field result is marked with the field separator and the field result and the field itself are terminated by a field end mark.
The CP locations of the field begin mark, field separator, and field end mark are recorded in plcfld data structures that are maintained for the main document and all of the subdocuments of the main document whenever a field is inserted or edited. A field can be dead, in which case it has no field separator, no field result, and no entry in the plcfld. (See the definition of the FLD structure for a list of possible dead field code strings.) An array of two-byte FLD structures is stored in the plcfld in one-to-one correspondence with the CP entries recorded. An FLD associated with a field begin mark records the type of the field. An FLD associated with the field end mark records the current status of the field (i.e. whether the result is dirty or has been edited, whether the result has been locked, etc.)
Fields may be nested. 20 levels of nesting are permitted.
bookmark
A bookmark associates a user definable name with a range of text within a document. A bookmark is frequently used as an operand in field code instructions within a field. In Window's Word a bookmark is represented by three parallel data structures, the sttbBkmk, the plcbkf and the plcbkl. The sttbBkmk is a string table which contains the name of each bookmark that is defined. The plcbkf records the beginning CP position of each bookmark. The plcbkl records the limit CP position that delimits the end of a bookmark. Since bookmarks may be nested within one another to any level, the BKF structure stored in the plcbkf consists of a single index which specifies which plcbkl marks the end of the bookmark. The BKL structure is not written to the file, and the plcbkl contains only CPs.
picture
A picture is represented in the document text stream as a special character, an ASCII 1 whose CHP has the fSpec bit set to 1. The file location of the picture in the Word binary file is stored in the character's CHP in chp.fcPic. The fcPic is a byte offset into the data stream. Beginning at the position recorded in chp.fcPic, a header data structure, the PIC, will be stored. If the picture is a reference to a TIFF file, a Picture file or an Office shape file, the name of the file will be recorded immediately following the PIC in a Pascal style string. If the picture is an Office shape, a Window's metafile or a bitmap, the shape, metafile or bitmap will immediately follow the PIC. Pictures that are a reference to an Office shape file will include both the filename and the shape in that order. Pictures inserted with Word97 are in the new Office shape format (documented elsewhere). However, pictures can be copied from older files into newer ones and their old format will persist until the picture is edited or displayed.
Some files (including all files created by Word for the Macintosh) may store Macintosh PICT pictures as well. In this case, the PIC structure is immediately followed by a standard Windows metafile depicting a large "x", so that older readers expecting only a metafile after the PIC will just display this "x". If a reader detects this standard "x" metafile, it can extract the sizes of the standard "x" metafile and the Macintosh PICT picture that follows it from an early portion of this "x" metafile. Please see Appendix B for a discussion of this technique.
embedded object
The native data for Embedded objects (OBJs) is stored similarly to pictures (PICs). To locate the native data for Embedded objects, scan the plc of field codes for the mother, header, footnote and annotation, textbox and header textbox documents (fib.PlcffldMom/Hdr/Ftn/Atn/Txbx/HdrTxbx). For each separator field, get the chp.
If chp.fSpec=1 and chp.fObj=1, then this separator field has an associated embedded object. The file location of the object data is stored in chp.fcObj. At the specified location an object header is stored followed by the native data for the object. See the _OBJHEADER structure.
If chp.fOle2=1, then this separator field has an associated OLE2 object. The fcPic will be a unique integer that specifies the name of the object's sub-storage instead of an offset into the data stream.
office art object
An office art object is represented in the document stream as a special character, an ASCII 8, which has chp.fSpec set to 1 for the run of text containing the character . Only main documents and header documents contain office art objects. The native data for the office art object may be obtained by taking the CP for the special character and using this to find the corresponding entry in the plcspa. An entry in this plc consists of a FSPA structure, which is described elsewhere in this document.
Office art objects can have text attached to them. Text for the textboxes is stored separately in the textbox subdocument of the main or header document. The textbox subdocument contains a plctxbxs where the text from CP n to CP n+1 in the subdocument is the text which is contained in a textbox as specified in the TXBXS structure for this nth entry in the plctxbxs. Textboxes can be linked in chains of up to 32 textboxes. Ordering of textboxes in the subdocument is completely unrelated to the document structure due to the nature of textbox linking. To find the text for a given office art object, the TXID property (a long: high word is itxbxs+1, low word is the sequence number) must be fetched from the office art data for the shape. This contains an index (itxbxs) into plctxbxs and a sequence number in the chain of linked textboxes. The text for the entire chain of linked textboxes is stored from the CP itxbxs to CP itxbxs+1 of plctxbxs. The plctxbxBkd describes the "page table" within textbox stories (where the textboxes in each linked textbox chain are thought of as "pages"). So, for each entry in the plctxbxs there is a corresponding entry in the plctxbxBkd at the same CP, and there may be additional entries in the plctxbxBkd to describe the breaks from one textbox to the next in linked textbox chains.
Note
In this document, bit 0 is the low-order bit. Structures are described as they would be declared in C for the Intel architecture. When numbering bytes in a word from low offset towards high offset, two-byte integers will have their least significant eight bits stored in byte 0 and most significant eight bits in byte 1. If bit 31 is the most significant bit in a four-byte integer, bits 31 through 24 will be stored in byte 3 of a four-byte integer, bits 23 through 16 will be stored in byte 2, bits 15 through 8 will be stored in byte 1, and bits 7 through 0 will be stored in byte 0.