Unicode Versus Basic

Stringwise, we are cursed to live in interesting times. The world according to Microsoft (and many other international companies) is moving from ANSI to Unicode characters, but the transition isn’t exactly a smooth one.

Most of the Unicode confusion comes from the fact that we are in the midst of a comprehensive change in the way characters are represented. The old way uses the ANSI character set for the first 256 bytes but reserves some characters as double byte character prefixes so that non-ANSI character sets can be represented. This is very efficient for the cultural imperialists who got there first with Latin characters, but it’s inefficient for those who use larger character sets such as Chinese ideograms and Sumerian hieroglyphics. Unicode represents all characters in two bytes. This is inefficient for the cultural imperialists (although they still get the honor of claiming most of the first 128 characters with zero in the upper byte), but it’s more efficient (and more fair) for the rest of the world. Instead of having 256 unique characters, you can have 65,535—enough to handle all the characters of almost all the world’s languages.

Eventually, everybody will use Unicode, but different systems have chosen different ways of dealing with the transition.

What does this mean for you? Trouble.