Building Tables for Increased Efficiency

In most languages, words consist of characters separated by white space and punctuation. Rather than calling the system each time it analyzes a letter in the data stream, the search engine builds several tables at startup. These tables indicate which characters can be interpreted by the user's operating environment and whether these characters are letters, numbers, or punctuation. (To accommodate the word-breaking rules of more complex languages, the search engine supports external word-breaking DLLs. See the section titled "Wordwrap Functions" in the Microsoft Win32 Programmer's Reference, Volume 1, for information on creating a word-breaking DLL.)

To build these tables, the engine calls GetStringType and LCMapString, which can return information for an entire string of characters. To work around the limitations of Windows 95 or Win32s, which do not support the wide-character NLSAPI functions, the code calls several additional functions. First it calls EnumSystemCodePages to determine which code-page conversion tables are installed in the user's system. (Remember that the engine does internal processing in Unicode.) After the engine has determined which code pages the user's system supports, it marks the characters contained in those code pages as "valid" for the user's system by adding character type information in the corresponding table entries.

If the user is running WinHelp on a DBCS system (Chinese, Japanese, Korean, or Thai Windows 95, for example), the search engine also creates a table of legal 2-byte characters. It calls GetCPInfo to retrieve the MaxCharSize for each installed code page. If the MaxCharSize for a code page is 2, that code page is DBCS. Because GetCPInfo also returns the lead-byte ranges, the search engine can quickly iterate through all possible lead-byte and trail-byte combinations, calling MultiByteToWideChar with the flag MB_ERR_INVALID_CHAR to determine whether each combination is a valid character.

With this series of API calls, the search engine has constructed several tables that indicate which characters can be interpreted by the user's system. Next the engine calls GetStringType on these characters to determine their properties. Win32 separates character properties into three categories, which are listed in Figure 5-10 below with the constant values that GetStringType returns.

*Category*	Constant	Description
CTYPE 1
	C1_UPPER	Uppercase
	C1_LOWER	Lowercase
	C1_DIGIT	Decimal digits
	C1_SPACE	Space characters
	C1_PUNCT	Punctuation
	C1_CNTRL	Control characters
	C1_BLANK	Blank characters
	C1_XDIGIT	Hex digits
	C1_ALPHA	Any linguistic character (alphabetic, syllabary, and ideographic)

CTYPE 2
Strong	C2_LEFTTORIGHT	Left to right
Directionality	C2_RIGHTTOLEFT	Right to left

Weak	C2_EUROPENUMBER	European number, European digit
Directionality	C2_EUROPESEPARATOR	European numeric separator
	C2_EUROPETERMINATOR	European numeric terminator
	C2_ARABICNUMBER	Arabic number
	C2_COMMONSEPARATOR	Common numeric separator

Neutral	C2_BLOCKSEPARATOR	Block separator
	C2_SEGMENTSEPARATOR	Segment separator
	C2_WHITESPACE	White space
	C2_OTHERNEUTRAL	Other neutrals

No Directionality	C2_NOTAPPLICABLE	No implicit directionality (for example, control codes)

CTYPE 3
	C3_NONSPACING	Nonspacing mark
	C3_DIACRITIC	Nonspacing diacritic
	C3_VOWELMARK	Nonspacing vowel mark
	C3_SYMBOL	Symbol
	C3_KATAKANA	Katakana character
	C3_HIRAGANA	Hiragana character
	C3_HALFWIDTH	Half-width character
	C3_FULLWIDTH	Full-width character
	C3_IDEOGRAPH	Ideographic character
	C3_KASHIDA	Arabic kashida character
	C3_LEXICAL	Punctuation that either is embedded in a word or appears at the end of a word and is still considered part of the word; items include the apostrophe, the kashida, the hyphen, feminine/masculine ordinal indicators, the equal sign (used as a hyphen in parts of Europe), and so on
	C3_ALPHA	All linguistic characters (alphabetic, syllabary, and ideographic)
	C3_NOTAPPLICABLE	Not applicable

Figure 5-10 CTYPE categories for the function GetStringType. The values that GetStringType returns are based on Unicode and remain constant, regardless of the default system locale or code page.

The WinHelp engine parses the text stream of each document character by character using these tables. It throws out white space and creates a list of words and punctuation sequences, which it then sorts.