In most languages, words consist of characters separated by white space and punctuation. Rather than calling the system each time it analyzes a letter in the data stream, the search engine builds several tables at startup. These tables indicate which characters can be interpreted by the user's operating environment and whether these characters are letters, numbers, or punctuation. (To accommodate the word-breaking rules of more complex languages, the search engine supports external word-breaking DLLs. See the section titled "Wordwrap Functions" in the Microsoft Win32 Programmer's Reference, Volume 1, for information on creating a word-breaking DLL.)
To build these tables, the engine calls GetStringType and LCMapString, which can return information for an entire string of characters. To work around the limitations of Windows 95 or Win32s, which do not support the wide-character NLSAPI functions, the code calls several additional functions. First it calls EnumSystemCodePages to determine which code-page conversion tables are installed in the user's system. (Remember that the engine does internal processing in Unicode.) After the engine has determined which code pages the user's system supports, it marks the characters contained in those code pages as "valid" for the user's system by adding character type information in the corresponding table entries.
If the user is running WinHelp on a DBCS system (Chinese, Japanese, Korean, or Thai Windows 95, for example), the search engine also creates a table of legal 2-byte characters. It calls GetCPInfo to retrieve the MaxCharSize for each installed code page. If the MaxCharSize for a code page is 2, that code page is DBCS. Because GetCPInfo also returns the lead-byte ranges, the search engine can quickly iterate through all possible lead-byte and trail-byte combinations, calling MultiByteToWideChar with the flag MB_ERR_INVALID_CHAR to determine whether each combination is a valid character.
With this series of API calls, the search engine has constructed several tables that indicate which characters can be interpreted by the user's system. Next the engine calls GetStringType on these characters to determine their properties. Win32 separates character properties into three categories, which are listed in Figure 5-10 below with the constant values that GetStringType returns.
Category | Constant | Description |
CTYPE 1 | ||
C1_UPPER | Uppercase | |
C1_LOWER | Lowercase | |
C1_DIGIT | Decimal digits | |
C1_SPACE | Space characters | |
C1_PUNCT | Punctuation | |
C1_CNTRL | Control characters | |
C1_BLANK | Blank characters | |
C1_XDIGIT | Hex digits | |
C1_ALPHA | Any linguistic character (alphabetic, syllabary, and ideographic) | |
CTYPE 2 | ||
Strong | C2_LEFTTORIGHT | Left to right |
Directionality | C2_RIGHTTOLEFT | Right to left |
Weak | C2_EUROPENUMBER | European number, European digit |
Directionality | C2_EUROPESEPARATOR | European numeric separator |
C2_EUROPETERMINATOR | European numeric terminator | |
C2_ARABICNUMBER | Arabic number | |
C2_COMMONSEPARATOR | Common numeric separator | |
Neutral | C2_BLOCKSEPARATOR | Block separator |
C2_SEGMENTSEPARATOR | Segment separator | |
C2_WHITESPACE | White space | |
C2_OTHERNEUTRAL | Other neutrals | |
No Directionality | C2_NOTAPPLICABLE | No implicit directionality (for example, control codes) |
CTYPE 3 | ||
C3_NONSPACING | Nonspacing mark | |
C3_DIACRITIC | Nonspacing diacritic | |
C3_VOWELMARK | Nonspacing vowel mark | |
C3_SYMBOL | Symbol | |
C3_KATAKANA | Katakana character | |
C3_HIRAGANA | Hiragana character | |
C3_HALFWIDTH | Half-width character | |
C3_FULLWIDTH | Full-width character | |
C3_IDEOGRAPH | Ideographic character | |
C3_KASHIDA | Arabic kashida character | |
C3_LEXICAL | Punctuation that either is embedded in a word or appears at the end of a word and is still considered part of the word; items include the apostrophe, the kashida, the hyphen, feminine/masculine ordinal indicators, the equal sign (used as a hyphen in parts of Europe), and so on | |
C3_ALPHA | All linguistic characters (alphabetic, syllabary, and ideographic) | |
C3_NOTAPPLICABLE | Not applicable |
Figure 5-10 CTYPE categories for the function GetStringType. The values that GetStringType returns are based on Unicode and remain constant, regardless of the default system locale or code page.
The WinHelp engine parses the text stream of each document character by character using these tables. It throws out white space and creates a list of words and punctuation sequences, which it then sorts.