Surrogates

Platform SDK: International Features

Surrogates

There is a need to support more characters than currently fit in the Unicode character set. For example, while Unicode allows over 65,000 characters, the Chinese speaking community alone uses at least 55,000 characters. To answer this need, the Unicode Standard defines surrogates. A surrogate or surrogate pair is a pair of 16-bit Unicode characters that represent a single character or glyph. The first (high) surrogate is a 16-bit character in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit character in the range U+DC00 to U+DFFF. Using surrogates, Unicode can support over one million characters. For more details about surrogates, refer to the Unicode Standard, version 2.0.

Windows 2000 provides support for basic input, output, and simple sorting of surrogates. However, not all Windows 2000 system components are surrogate compatible. Also, surrogates are not supported in Windows 95/98 or in Windows NT 4.0.

Windows 2000 supports surrogates in the following ways:

The cmap 12 OpenType font format is introduced, which directly supports the 4-byte character code. Refer to the OpenType font specification for more detail.
Windows USER supports surrogate-enabled IMEs.
Windows GDI APIs support cmap 12 so surrogates can be displayed correctly.
Uniscribe APIs support surrogates.
Windows controls, including Edit and Rich Edit, support surrogates.
HTML engine supports HTML page that includes surrogates for display, editing (through Outlook Express), and forms submission.
System sorting table supports surrogates.
Planes two and three (defined in ISO/IEC 10646) are reserved for ideographic characters.These planes fall in the high surrogate range of U+D840 to U+D8BF.

General Guidelines for Software Development

Windows 2000 handles surrogates as pairs of 16-bit characters. The system processes surrogate pairs in a way similar to the way it processes nonspacing marks. At display time, the surrogate pair display as one glyph by means of Uniscribe. (This conforms to the requirements in the Unicode Standard, version 2.0)

Applications automatically support surrogates if they support Unicode and use system controls and standard APIs, such as ExtTextOut and DrawText. Thus, if your code uses standard system controls or uses general ExtTextOut-type calls to display, surrogate pairs should work without any changes necessary.

Applications implementing their own editing support by working out glyph positions for themselves may use Uniscribe for all text processing. Uniscribe has separate APIs to deal with complex script processing (such as line service, hit testing, and cursor movement). The application must call the Uniscribe APIs specifically to get these advanced features. Applications written to the Uniscribe API are fully multilingual. However, this does impose a performance penalty, so some applications may want to do their own surrogate processing.

Since surrogates are well defined, you can also write your own code to handle surrogate text processing. When a program encounters a separated Unicode value from either the lower reserved range or the upper reserved range, it must be one half of a surrogate pair. Thus, you can detect a surrogate pair by doing simple range checking. If you encounter a Unicode value in the lower or upper range, then you need to track backward or forward one 16-bit width to get the rest of the character. Keep in mind that CharNext and CharPrev move by 16-bit code points, not by surrogates.

For sorting, note that all surrogate pairs are treated as two Unicode code points. Surrogates are sorted after other Unicode code points, but before the PUA (private user area). Sorting for a standalone surrogate character (that is, either the high or low character is missing) is not supported.

If you are a font or IME provider, note that Windows 2000 disables surrogate support by default. If you provide a font and IME package that requires surrogate support, you must set the following registry values:

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\LanguagePack] 
SURROGATE=(REG_DWORD)0x00000002

[HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\International\Scripts\42]
IEFixedFontName=[Surrogate Font Face Name]
IEPropFontName=[Surrogate Font Face Name]