DBCS Programming Basics

Each 2-byte character is composed of a Lead Byte- and a trail byte that must be stored together and manipulated as a unit. A lead-byte value always falls into one or more ranges above 127; no 7-bit ASCII character can be a lead byte. NULL can never be a trail byte but the range of possible trail bytes can overlap to some degree with ASCII. Trail-byte values are frequently indistinguishable from lead-byte values; the only way to tell the difference is from the context of the surrounding characters. Furthermore, a trail byte taken without its Lead Byte- can be mistaken for a single-byte character. Code that scans a double-byte character-set string for a single-byte character such as a backslash (\), might "find" the second half of a kanji character.

In the nonsensical filename below, the second DBCS character has a trail byte equal to the backslash. This is how the filename would appear on a DBCS system:

The same filename, however, might look like a pathname when processed by a DBCS-ignorant program.

With double-byte characters, code that searches, selects, edits, moves, replaces, deletes, or inserts text must check for double-byte pairs, as shown below:

// Return pointer to the first '\' in a given string.
char* GetBackslash(char *pszStr)
{
while (*pszStr)
{
if the current byte is not a Lead Byte-
if it is a '\'
break out of loop
else // it is a lead byte
increment pointer 1 byte to point to trail byte

increment pointer 1 byte to point to next character
}
return(pszStr);
}

If you separate a lead byte from its trail byte you will trash your string. In the following example, inserting the single-byte value 0x41 (ASCII A) in the middle of a double-byte character yields strange results when a program has not been properly DBCS-enabled.

Before After


The Lead Byte- of the kanji character combines with the A to create a different kanji and the trail byte of the original kanji becomes the single-byte katakana character se.

Display Operations

Not only do coding practices need to be adjusted to avoid splitting double-byte characters in two, but so do a program's display operations. Rules of selection, cursor placement, and cursor movement are the same as you would expect when dealing with alphabetic characters—the cursor should always end up between characters and never in the middle of one. The difference with double-byte characters is that they are a combination of two encoding units. In a command-line interface, such as the Windows NT console mode, double-byte characters are generally twice as wide as ASCII characters. In Windows, ASCII characters can be drawn with proportional fonts, but ideographic characters, including Japanese kana, are always monospaced.

In the operations shown in the following example, the cursor should never end up bisecting a double-byte character.

  Placing the Cursor Backspace-Deleting a Character Selecting a Character
       
Correct
Incorrect


If you click the mouse on the leftmost three-quarters of the character, the cursor should end up to the left of the character. If you click it on the rightmost quarter, the cursor should end up to the right of the character.