Text canonicalization

[This is preliminary documentation and subject to change.]

Generally the text output from a call to GetText should exactly match the actual text of the document, but in order to achieve maximum interoperability some canonicalization of common features is desirable. These features include paragraph breaks, line breaks, hyphens and spaces.

The four flags controlling canonicalization of the output text are defined as follows:

IFILTER_INIT_CANON_PARAGRAPHS: Paragraph breaks should be marked with the Unicode PARAGRAPH SEPARATOR (0x2029).
IFILTER_INIT_HARD_LINE_BREAKS: Soft line breaks (such as end of line in Microsoft® Word) should be replaced by hard line breaks, LINE SEPARATOR (0x2028). Existing hard line breaks may be doubled. Any of carriage return (0x000D), line feed (0x000A), or the carriage return and line feed combination should be considered a hard line break. The intent is to enable pattern-expression matchers that match against the observed line breaks.
IFILTER_INIT_CANON_HYPHENS: Various word processors have forms of hyphens that are not represented in the host character set, such as optional hyphens (appearing only at end of line) and non-breaking hyphens. This flag indicates that optional hyphens are to be nulled out, and non-breaking hyphens are to be converted to normal, plain hyphens (0x2010), or HYPHEN-MINUSES (0x002D).
IFILTER_INIT_CANON_SPACES: As the previous flag canonicalizes hyphens, this one canonicalizes spaces. All special space characters, such as non-breaking spaces, etc., are to be converted to the standard SPACE character (0x0020).

IFilter servers are also allowed to embed null characters in the text, which are typically ignored by clients. Unicode character 0x0000 are completely ignored, and 0x0001 is treated as a word break.

The intent is to provide implementors of IFilter an efficient means to 'remove' embedded formatting from text without modifying positional information. A scrap of HTML such as:

<p>This is a paragraph with <em>emphasized</em> text.</p>

could be filtered as:

***This is a paragraph with ****emphasized**** text.****

where the '*' represents Unicode 0x0000.