DBCS-Enabled Programs vs. Non-DBCS-Enabled Programs

Glossary

If you have both the US and the Japanese edition of Windows 95, you can compare incorrect DBCS behavior with correct DBCS behavior using Microsoft Wordpad. The edition of Wordpad that comes with the US edition of Windows 95 is not DBCS-enabled, as you'll find out if you try running it on the Japanese edition of Windows 95. Open a file containing Japanese text, such as a README file, and attempt to do some basic editing. If you press the Left or Right arrow key, press Backspace, or click around with the mouse, you'll notice that the insertion point improperly bisects all full-width characters. To activate the Input Method Editor (IME) on the Japanese edition of Windows 95, press Alt+~. (See Chapter 7 for more information on how to use IMEs.) Try inserting and deleting characters in random places. If you add or delete a half-width character, full-width characters in the string might shift by 1 byte. This can cause bizarre behavior, as demonstrated in Figure 3-4.

A phrase from a file containing Japanese text.
When the user presses an arrow key, the cursor bisects DBCS characters.
Selecting half of a DBCS character and a full-width katakana character.
Hitting Delete. Oops.


Figure 3-4 Editing a DBCS file on Japanese Windows 95 using the US edition of Wordpad.

Now open the same file using the Japanese edition of Wordpad that comes with the Japanese edition of Windows 95. Try performing some of the same operations you tried with the US edition. You'll get a feel for why it's necessary to DBCS-enable your Windows 95–based code and why you can't simply ship your US edition to the Far East. Imagine how frustrated users would feel if your software behaved this badly when dealing with double-byte characters.

Keeping Lead Byte's and trail bytes together requires some coding vigilance. Strings that might contain double-byte characters should be parsed from the beginning to the end not from the end to the beginning. If a DBCS string is processed backward, it's generally not possible to tell whether a byte is a character by itself or the second half of a double-byte pair. (See the section titled "How to Go Backward in a DBCS String" later in this chapter.) The Win32 API CharPrev actually goes back to the beginning of DBCS strings and steps through until it finds the previous character in question; going forward is easier than going backward. The Windows API IsDBCSLeadByte can be used to test whether a particular byte is in the default code page's lead-byte range. (IsDBCSLeadByteEx allows you to check the lead-byte range of a specified code page.) You can process any single-byte character you find immediately. For example, you can display it on the screen. If your program finds a Lead Byte-, it must read the next byte before doing any further processing. Figure 3-5 includes the lead-byte and trail-byte ranges for the code pages used in the Far East editions of Windows 95.


Language
Character
Set Name
Code
Page
Lead-Byte
Ranges
Trail-Byte
Ranges
Chinese
(Simplified)
GB 2312-80 CP 936 0xA1–0xFE 0xA1–0xFE
         
Chinese
(Traditional)
Big-5 CP 950 0x81–0xFE 0x40–0x7E
0xA1–0xFE
         
Japanese Shift-JIS (Japan
Industry Standard)
CP 932 0x81–0x9F
0xE0–0xFC
0x40–0xFC
(except 0x7F)
         
Korean
(Wansung)
KS C-5601-1987 CP 949 0x81–0xFE 0x41–0x5A
0x61–0x7A
0x81–0xFE
         
Korean
(Johab)
KS C-5601-1992 CP 1361 0x84–0xD3
0xD8
0xD9–0xDE
0xE0–0xF9
0x41–0x7E
0x81–0xFE
(Government
standard:
0x31–0x7E
0x41–0xFE)


Figure 3-5 Lead-byte and trail-byte ranges for code pages used in Far East editions of Windows 95.

When you are faced with the potential mix of single-byte and double-byte characters it is no longer safe to use operators such as ++ or --, which increment or decrement string pointers 1 byte at a time. These operators can be replaced with the Win32 API calls CharNext and CharPrev (AnsiNext and AnsiPrev in 16-bit Windows 3.x), which increment pointers properly whether the current character is single-byte or double-byte. In the double-byte world, it is also dangerous to access a string randomly, as in

char = string[i];

Look back at the sample string in Figure 3-3 above. The value of samplestring[2] is 0x40, but it represents the second half of a kanji character, not the "at" character (@).