Chapter 3 Encoding Character Sets

Most developers of international Microsoft Windows–based programs have at some point banged their heads against the wall trying to come to grips with character encodings. The mishmash of standards makes it hard for users to share data and for programmers to create worldwide software. Some standards are 7-bit; others are 8-bit. Single-byte character sets come in several flavors, as do the double-byte standards, which are also called multibyte because they are really a mix of single-byte and double-byte character codes. Trying to pass data from different character encodings across networks or between operating systems involves a gauntlet of mappings, conversions, fonts, and headaches.

The reasons for this complexity are historical, and as technology evolves, working with character encodings will get easier. A band of international thinkers has created a standard for the future called Unicode, which newer operating systems such as Microsoft Windows NT have adopted. But until Unicode is more widely used, programmers will still have to navigate through a rough sea of character-encoding standards. Because understanding character sets is basic to any internationalization effort, this chapter provides a map for understanding them in Windows NT, which is based on Unicode, and in Microsoft Windows 95, which is not based on Unicode but is based on the same family of code pages introduced by Microsoft Windows 3.1.