93.13 Special Characters

There are a few special characters and characters with special semantics that are important when dealing with Unicode text strings.

UNICODE_NULL

The code 0x0000 is the Unicode string terminator for null-terminated strings. A single null byte is not sufficient for this, as many Unicode characters contain null bytes as either their high or low byte. An example is 'A,' which is 0x0041.

Rule 0: Always use (TCHAR) 0 when null-terminating strings.

Not a Character

The code points 0xFFFF and 0xFFFE are not characters and therefore do not have Unicode character names. They never form part of Unicode plain text. The use of 0xFFFF is reserved for program private use, for example, as sentinel code. It is illegal in plain text files or across the API. The special circumstance under which 0xFFFE might occur is explained in the next rule.

Rule 1: 0xFFFF won't be understood, except by yourself.

BYTE_ORDER_MARK

Since Unicode plain text is a sequence of 16-bit codes, it is sensitive to the byte-ordering that is used when writing the text. Intel and MIPS chips have the least significant byte first, while Motorola chips have the most significant byte first. Ideally, all Unicode would follow only one set of rules, but this would force one side to always swap the byte order on reading and writing plain text files, even when the file never leaves the system on which it was created.

Plain text files lack the file header, which would be the preferred place to specify byte order. To have a cheap way to indicate which byte order is used in a text, Unicode has defined a character 0xFEFF byte order mark (BOM) and a noncharacter 0xFFFE, which are the mirror byte-image of each other. The byte order mark is not a control character that selects the byte order of the text; rather, its function is to inform recipients that they are looking at a correctly byte-ordered file.

As a side benefit, since the sequence 0xFEFF is exceedingly rare at the outset of regular non-Unicode text files, it can serve as an implicit marker or signature to identify the file as Unicode. Programs that are written to read both Unicode and non-Unicode text files should use the presence of this sequence as a very strong hint that the file is in Unicode. Compare this to using Ctrl-Z to terminate text files.

Rule 2: Always prefix any Unicode plain text file with a BOM.

ASCII Control Characters

The first 32 sixteen-bit characters in Unicode are intended for encoding the 32 control characters. In this manner, existing use of control characters for formatting purposes can be supported. Unicode programs can treat these control codes in exactly the same way as they used to treat their equivalents in ASCII.

ASCII control characters, including arguments in escape sequences, are always converted to their Unicode wide-character equivalent.

When an ASCII plain text file is converted to Unicode, there is a chance that it will be converted back to ASCII later on, perhaps on the receiving end of a transmission. Converting escape sequences into Unicode on a character basis (ESC A turns into 0x001B escape, 0x0041 latin capital letter a) will allow the reverse conversion to be performed without the need to recognize and parse the escape sequence as such.

Rule 3: Translate escape sequences character by character into Unicode.

Line and Paragraph Separator

Unicode has two special characters, 0x2028 line separator and 0x2029 paragraph separator. A new line is begun after each line separator. A new paragraph is begun after each paragraph separator. Since these are separator codes, it is not necessary to either start the first line or paragraph or to end the last line or paragraph with them. Rather, doing so would indicate that there was an empty paragraph or line in that location.

The paragraph separator can be inserted between paragraphs of text. Its use allows plain text files to be created that can be laid out on a different line width at the receiving end. The line separator can be used to indicate an unconditional end of line. In fact, they work just like SHIFT+ENTER or ENTER work in Word. However, they do NOT correspond to CR and LF, or CR/LF. See the next rule.

Rule 4: Use the line and paragraph separator to divide plain text.

Interaction With CR/LF

The Unicode Standard does not prescribe a specific semantic to 0x000D carriage return and 0x000A line feed but leaves it up to your program to interpret these codes as well as to decide whether to require their use and whether CR/LF pairs or single codes are needed. This is really no change from the way these codes are used in ASCII.

Rule 5: Interpretation of ASCII control codes depends on the program.

Non-spacing Characters and Floating Accents

Many scripts contain characters which, on output, combine with other characters visually. A prominent example of this are the so-called floating accents. In Unicode, nonspacing characters follow their base character. The Windows 32-bit API will provide an extended ctype function, which lets an application determine whether a character is nonspacing. When breaking lines or otherwise separating text, make sure not to separate characters that belong together.

Rule 6: Keep nonspacing characters with their base character.