DBCS-Enabled Programs vs. Non-DBCS-Enabled Programs

Glossary

Big-5: The multibyte encoding standardized by Taiwan.
GB 2312-80: The multibyte encoding standardized by the People's Republic of China.
KS C-5601-1987: The multibyte Wansung encoding standardized by Korea.
KS C-5601-1992: The multibyte Johab encoding standardized by Korea.

If you have both the US and the Japanese edition of Windows 95, you can compare incorrect DBCS behavior with correct DBCS behavior using Microsoft Wordpad. The edition of Wordpad that comes with the US edition of Windows 95 is not DBCS-enabled, as you'll find out if you try running it on the Japanese edition of Windows 95. Open a file containing Japanese text, such as a README file, and attempt to do some basic editing. If you press the Left or Right arrow key, press Backspace, or click around with the mouse, you'll notice that the insertion point improperly bisects all full-width characters. To activate the Input Method Editor (IME) on the Japanese edition of Windows 95, press Alt+~. (See Chapter 7 for more information on how to use IMEs.) Try inserting and deleting characters in random places. If you add or delete a half-width character, full-width characters in the string might shift by 1 byte. This can cause bizarre behavior, as demonstrated in Figure 3-4.

	A phrase from a file containing Japanese text.
	When the user presses an arrow key, the cursor bisects DBCS characters.
	Selecting half of a DBCS character and a full-width katakana character.
	Hitting Delete. Oops.

Figure 3-4 Editing a DBCS file on Japanese Windows 95 using the US edition of Wordpad.

Now open the same file using the Japanese edition of Wordpad that comes with the Japanese edition of Windows 95. Try performing some of the same operations you tried with the US edition. You'll get a feel for why it's necessary to DBCS-enable your Windows 95–based code and why you can't simply ship your US edition to the Far East. Imagine how frustrated users would feel if your software behaved this badly when dealing with double-byte characters.

Keeping Lead Byte's and trail bytes together requires some coding vigilance. Strings that might contain double-byte characters should be parsed from the beginning to the end not from the end to the beginning. If a DBCS string is processed backward, it's generally not possible to tell whether a byte is a character by itself or the second half of a double-byte pair. (See the section titled "How to Go Backward in a DBCS String" later in this chapter.) The Win32 API CharPrev actually goes back to the beginning of DBCS strings and steps through until it finds the previous character in question; going forward is easier than going backward. The Windows API IsDBCSLeadByte can be used to test whether a particular byte is in the default code page's lead-byte range. (IsDBCSLeadByteEx allows you to check the lead-byte range of a specified code page.) You can process any single-byte character you find immediately. For example, you can display it on the screen. If your program finds a Lead Byte-, it must read the next byte before doing any further processing. Figure 3-5 includes the lead-byte and trail-byte ranges for the code pages used in the Far East editions of Windows 95.

*Language*	Character Set Name	Code Page	Lead-Byte Ranges	Trail-Byte Ranges
Chinese (Simplified)	GB 2312-80	CP 936	0xA1–0xFE	0xA1–0xFE

Chinese (Traditional)	Big-5	CP 950	0x81–0xFE	0x40–0x7E 0xA1–0xFE

Japanese	Shift-JIS (Japan Industry Standard)	CP 932	0x81–0x9F 0xE0–0xFC	0x40–0xFC (except 0x7F)

Korean (Wansung)	KS C-5601-1987	CP 949	0x81–0xFE	0x41–0x5A 0x61–0x7A 0x81–0xFE

Korean (Johab)	KS C-5601-1992	CP 1361	0x84–0xD3 0xD8 0xD9–0xDE 0xE0–0xF9	0x41–0x7E 0x81–0xFE (Government standard: 0x31–0x7E 0x41–0xFE)

Figure 3-5 Lead-byte and trail-byte ranges for code pages used in Far East editions of Windows 95.

When you are faced with the potential mix of single-byte and double-byte characters it is no longer safe to use operators such as ++ or --, which increment or decrement string pointers 1 byte at a time. These operators can be replaced with the Win32 API calls CharNext and CharPrev (AnsiNext and AnsiPrev in 16-bit Windows 3.x), which increment pointers properly whether the current character is single-byte or double-byte. In the double-byte world, it is also dangerous to access a string randomly, as in

char = string[i];

Look back at the sample string in Figure 3-3 above. The value of samplestring[2] is 0x40, but it represents the second half of a kanji character, not the "at" character (@).