Using Sort Keys to Sort and Search for Data

Glossary

Sort key: A numeric representation of a sort element based on locale-specific sorting rules. A sort key consists of several weighted components that represent a character's script, diacritics, case, and so on.

Traditionally, list boxes are displayed in sorted order so that users can easily find individual items. International users, however, might have very different expectations of what constitutes a "sorted" list; not only does alphabetical order vary among languages, but conventions for sequencing items in dictionaries and phone books can also be quite different. In the French language, for example, words that are spelled identically (except for diacritics) are sorted based on a right-to-left comparison of characters rather than a left-to-right comparison. In Swedish, some vowels with diacritics sort after Z. Languages that include characters outside the Latin script have special sorting rules. For example, the Icelandic character ð sorts between D and E. Appendix D compares the sort order for an identical list of words in several European languages.

Win32 provides basic support for sorting strings with the function CompareString. The system supports at least one sort algorithm each for a variety of languages that include standard options such as case-insensitive comparison. Because CompareString's algorithm is complex, it is faster to sort long lists of words by retrieving sort keys with LCMapString, caching the sort keys, and then comparing them using lstrcmp(). The full-text search engine for Microsoft WinHelp uses this sort-key method.

The order in which the search engine displays words and phrases varies depending on the system locale because LCMapString takes a locale ID as a parameter. A sort key is a composite representation of a sort element. (See Figure 5-11.) Each component of the sort key has a different weight, depending on the sort rules associated with the locale. Elements are sorted based on script (that is, all Greek characters sort together, all Cyrillic characters sort together, and so on); alphanumeric and symbolic characters; diacritics; case; and other special rules, such as two characters sorting as a single character (for example, the Spanish CH). Currently, except for Far Eastern locales, Windows associates only one sort algorithm with each locale. The predefined constant SORT_DEFAULT refers to whichever sort algorithm associated with a given language ID is commonly preferred.

Figure 5-11 The Win32 sort key for the German word Öde, which means wasteland. The first three values in the sort key represent the Unicode sort weights for each character in the string. The next three values represent the diacritic weights, and the final three values represent the case weights. Separator values fall between sections. The sort key ends with a null terminator.

Figure 5-12 below contains the WinHelp search engine procedure that maps lexical tokens into sort keys. You'll notice that the code is designed to work on both Windows 95 and Windows NT. In addition, you'll see that the search engine customizes each sort key by prefixing it with an application-defined flag. The WinHelp designers defined these flags to ensure that words would sort first, followed by numbers and punctuation. (The system's default behavior is to sort punctuation first, followed by numbers and letters.)

///////////////////////////////////////////////////////////////////
// sort key generation from Unicode

BYTE* pbSource; // multibyte string buffer
int cbSource; // size of pbSource buffer

int FAR PASCAL LCSortKeyW(
LCID lcid, // locale ID
LPCWSTR pwSource, // address of source string
int cwSource, // number of characters in
// source string
LPWSTR pwDest, // address of destination buffer
int cwDest) // size of destination buffer
{
int cb; // number of bytes for multibyte text
int nRet; // size of sort key in words

if (g_os_version == OS_NT)
nRet = LCMapStringW(lcid, LCMAP_SORTKEY,
pwSource, cwSource,
pwDest+1, (cwDest-1)<<1) >> 1;

else
{
cwSource <<= 1; // Multiply by two. (One wide character
// can generate 2 bytes of
// multibyte data.)

if (cwSource > cbSource)
{
cbSource = 0;
if (pbSource) delete pbSource;
if (pbSource = new BYTE[cwSource]) cbSource = cwSource;
}

if (!pbSource) return 0; // error return

cb = WideCharToMultiByte(GetCPFromLocale(lcid), 0,
pwSource, cwSource>>1,
(PSTR)pbSource, cbSource,
NULL, NULL);
nRet = LCMapStringA(lcid, LCMAP_SORTKEY,
(PSTR)pbSource, cb,
(PSTR)(pwDest+1), (cwDest-1)<<1) >> 1;
}

if (nRet)
{
nRet++;

if (cwDest && pwDest) // Set a sort key's prefix so tokens
// group first by alphabetics, then
// by numerics, then by punctuation.
{
BYTE bCharType = char_types(*pwSource);
*pwDest = ~(bCharType & WORD_TYPE);
}
}

return nRet;
}

Figure 5-12 An excerpt from the WinHelp full-text search engine, which creates Unicode-based sort keys.

Once the search engine has a sorted list of words, it can quickly look for matching strings using a binary search algorithm. Using sort keys is very convenient because they automatically handle language-specific and locale-specific sorting preferences, such as matching ligatures (in English, Æ is equivalent to AE, but this is not true in Danish) and distinguishing between single letters and two-character combinations. (In some Spanish locales, l and the combination ll are distinct letters.)

The WinHelp engine also uses sort keys to match substrings that fall at either the beginning or the end of a word. For example, a prefix search would match car with caring. A suffix search would match ring with caring. The engine makes substring searching considerably faster by defining upper and lower sort-key boundaries for each search. The prefix search for the word car tests all sort keys greater than or equal to car but less than cas. The suffix search for the word ring, which compares sort keys from right to left, tests all sort keys greater than or equal to gnir but less than gnis. The engine creates such key boundaries based on the sort rules of the user's default locale.