Building Tables for Increased Efficiency

In most languages, words consist of characters separated by white space and punctuation. Rather than calling the system each time it analyzes a letter in the data stream, the search engine builds several tables at startup. These tables indicate which characters can be interpreted by the user's operating environment and whether these characters are letters, numbers, or punctuation. (To accommodate the word-breaking rules of more complex languages, the search engine supports external word-breaking DLLs. See the section titled "Wordwrap Functions" in the Microsoft Win32 Programmer's Reference, Volume 1, for information on creating a word-breaking DLL.)

To build these tables, the engine calls GetStringType and LCMapString, which can return information for an entire string of characters. To work around the limitations of Windows 95 or Win32s, which do not support the wide-character NLSAPI functions, the code calls several additional functions. First it calls EnumSystemCodePages to determine which code-page conversion tables are installed in the user's system. (Remember that the engine does internal processing in Unicode.) After the engine has determined which code pages the user's system supports, it marks the characters contained in those code pages as "valid" for the user's system by adding character type information in the corresponding table entries.

If the user is running WinHelp on a DBCS system (Chinese, Japanese, Korean, or Thai Windows 95, for example), the search engine also creates a table of legal 2-byte characters. It calls GetCPInfo to retrieve the MaxCharSize for each installed code page. If the MaxCharSize for a code page is 2, that code page is DBCS. Because GetCPInfo also returns the lead-byte ranges, the search engine can quickly iterate through all possible lead-byte and trail-byte combinations, calling MultiByteToWideChar with the flag MB_ERR_INVALID_CHAR to determine whether each combination is a valid character.

With this series of API calls, the search engine has constructed several tables that indicate which characters can be interpreted by the user's system. Next the engine calls GetStringType on these characters to determine their properties. Win32 separates character properties into three categories, which are listed in Figure 5-10 below with the constant values that GetStringType returns.

Category Constant Description
CTYPE 1    
  C1_UPPER Uppercase
  C1_LOWER Lowercase
  C1_DIGIT Decimal digits
  C1_SPACE Space characters
  C1_PUNCT Punctuation
  C1_CNTRL Control characters
  C1_BLANK Blank characters
  C1_XDIGIT Hex digits
  C1_ALPHA Any linguistic character (alphabetic, syllabary, and ideographic)
     
CTYPE 2    
Strong C2_LEFTTORIGHT Left to right
Directionality C2_RIGHTTOLEFT Right to left
     
Weak C2_EUROPENUMBER European number, European digit
Directionality C2_EUROPESEPARATOR European numeric separator
  C2_EUROPETERMINATOR European numeric terminator
  C2_ARABICNUMBER Arabic number
  C2_COMMONSEPARATOR Common numeric separator
     
Neutral C2_BLOCKSEPARATOR Block separator
  C2_SEGMENTSEPARATOR Segment separator
  C2_WHITESPACE White space
  C2_OTHERNEUTRAL Other neutrals
     
No Directionality C2_NOTAPPLICABLE No implicit directionality (for example, control codes)
     
CTYPE 3    
  C3_NONSPACING Nonspacing mark
  C3_DIACRITIC Nonspacing diacritic
  C3_VOWELMARK Nonspacing vowel mark
  C3_SYMBOL Symbol
  C3_KATAKANA Katakana character
  C3_HIRAGANA Hiragana character
  C3_HALFWIDTH Half-width character
  C3_FULLWIDTH Full-width character
  C3_IDEOGRAPH Ideographic character
  C3_KASHIDA Arabic kashida character
  C3_LEXICAL Punctuation that either is embedded in a word or appears at the end of a word and is still considered part of the word; items include the apostrophe, the kashida, the hyphen, feminine/masculine ordinal indicators, the equal sign (used as a hyphen in parts of Europe), and so on
  C3_ALPHA All linguistic characters (alphabetic, syllabary, and ideographic)
  C3_NOTAPPLICABLE Not applicable

Figure 5-10 CTYPE categories for the function GetStringType. The values that GetStringType returns are based on Unicode and remain constant, regardless of the default system locale or code page.

The WinHelp engine parses the text stream of each document character by character using these tables. It throws out white space and creates a list of words and punctuation sequences, which it then sorts.