93.14 Unicode Plain Text Format

The Unicode standard provides a way to indicate the file type, and even the byte order inside the file, by use of a single wide character as the first character in the file:

0xFEFF = Byte order mark

0xFFFE = Illegal character (reverse of byte order mark)

Here is how these are used:

Table 93.1 Program Expects Unicode

FEFF

FFFE

neither

Treat FEFF as Unicode indicator. Skip it and process the rest of the file as Unicode.

Treat FFFE as byte-reversed Unicode indicator. Action: Flag wrong byte order as error. Optionally: Byte-swap file and process file as Unicode.

File is not marked. Action: Assume it is Unicode and process file as such. Optionally: Issue a warning that file may not be Unicode. Run a heuristic check.

Table 93.2 Program Does Not Expect Unicode

FEFF

FFFE

neither

Treat FEFF as possible Unicode indicator. Action: Flag as possible error, process as non-Unicode. Optionally: Map data to 8-bit before processing.

Treat FFFE as possible byte-reversed Unicode indicator. Action: Flag as possible error, process as non-Unicode. Optionally: Byte-swap file and map data to 8-bit before processing.

Don't expect file to be marked. Action: Assume it is non-Unicode data and process file as such. Optionally: Run a heuristic check.

Table 93.3 Program Will Take Either

FEFF

FFFE

neither

Treat FEFF as Unicode indicator. Skip it and process the rest of the file as Unicode.

Treat FFFE as byte-reversed Unicode indicator. Action: Flag wrong byte order as error. Optionally: Byte-swap file and process file as Unicode.

Non-Unicode files are normally unmarked. Action: Process as if non-Unicode. Optionally: Run a heuristic check.

If a byte order mark is found in the middle of a file, it is treated as ZERO WIDTH NO BREAK SPACE, that is, it does not print or word break.

The BYTE ORDER MARK character is not found in any code page, so it disappears if data is converted to ANSI. Unlike other Unicode characters, it is NOT replaced by a default character when converted.

The heuristic check for Unicode data could be as simple as a test whether the variation in the low order bytes is much higher than the variation in the high order bytes. For example, if ASCII is converted to Unicode, every second byte is 0. Also, detection 0x000a, and 0x000D for line feed and carriage return, and an even (vs. odd) file size are simple tests, that, taken together, will provide a strong indicator of the nature of the file.

“Treat as error” can mean any of the typical error handling actions for your application.