If you have both the US and the Japanese edition of Windows 95, you can compare incorrect DBCS behavior with correct DBCS behavior using Microsoft Wordpad. The edition of Wordpad that comes with the US edition of Windows 95 is not DBCS-enabled, as you'll find out if you try running it on the Japanese edition of Windows 95. Open a file containing Japanese text, such as a README file, and attempt to do some basic editing. If you press the Left or Right arrow key, press Backspace, or click around with the mouse, you'll notice that the insertion point improperly bisects all full-width characters. To activate the Input Method Editor (IME) on the Japanese edition of Windows 95, press Alt+~. (See Chapter 7 for more information on how to use IMEs.) Try inserting and deleting characters in random places. If you add or delete a half-width character, full-width characters in the string might shift by 1 byte. This can cause bizarre behavior, as demonstrated in Figure 3-4.
A phrase from a file containing Japanese text. | |
When the user presses an arrow key, the cursor bisects DBCS characters. | |
Selecting half of a DBCS character and a full-width katakana character. | |
Hitting Delete. Oops. |
Figure 3-4 Editing a DBCS file on Japanese Windows 95 using the US edition of Wordpad.
Now open the same file using the Japanese edition of Wordpad that comes with the Japanese edition of Windows 95. Try performing some of the same operations you tried with the US edition. You'll get a feel for why it's necessary to DBCS-enable your Windows 95based code and why you can't simply ship your US edition to the Far East. Imagine how frustrated users would feel if your software behaved this badly when dealing with double-byte characters.
Keeping Lead Byte's and trail bytes together requires some coding vigilance. Strings that might contain double-byte characters should be parsed from the beginning to the end not from the end to the beginning. If a DBCS string is processed backward, it's generally not possible to tell whether a byte is a character by itself or the second half of a double-byte pair. (See the section titled "How to Go Backward in a DBCS String" later in this chapter.) The Win32 API CharPrev actually goes back to the beginning of DBCS strings and steps through until it finds the previous character in question; going forward is easier than going backward. The Windows API IsDBCSLeadByte can be used to test whether a particular byte is in the default code page's lead-byte range. (IsDBCSLeadByteEx allows you to check the lead-byte range of a specified code page.) You can process any single-byte character you find immediately. For example, you can display it on the screen. If your program finds a Lead Byte-, it must read the next byte before doing any further processing. Figure 3-5 includes the lead-byte and trail-byte ranges for the code pages used in the Far East editions of Windows 95.
Language |
Character Set Name |
Code Page |
Lead-Byte Ranges |
Trail-Byte Ranges |
Chinese (Simplified) |
GB 2312-80 | CP 936 | 0xA10xFE | 0xA10xFE |
Chinese (Traditional) |
Big-5 | CP 950 | 0x810xFE | 0x400x7E 0xA10xFE |
Japanese | Shift-JIS (Japan Industry Standard) |
CP 932 | 0x810x9F 0xE00xFC |
0x400xFC (except 0x7F) |
Korean (Wansung) |
KS C-5601-1987 | CP 949 | 0x810xFE | 0x410x5A 0x610x7A 0x810xFE |
Korean (Johab) |
KS C-5601-1992 | CP 1361 | 0x840xD3 0xD8 0xD90xDE 0xE00xF9 |
0x410x7E 0x810xFE (Government standard: 0x310x7E 0x410xFE) |
Figure 3-5 Lead-byte and trail-byte ranges for code pages used in Far East editions of Windows 95.
When you are faced with the potential mix of single-byte and double-byte characters it is no longer safe to use operators such as ++ or --, which increment or decrement string pointers 1 byte at a time. These operators can be replaced with the Win32 API calls CharNext and CharPrev (AnsiNext and AnsiPrev in 16-bit Windows 3.x), which increment pointers properly whether the current character is single-byte or double-byte. In the double-byte world, it is also dangerous to access a string randomly, as in
char = string[i];
Look back at the sample string in Figure 3-3 above. The value of samplestring[2] is 0x40, but it represents the second half of a kanji character, not the "at" character (@).