Unicode Versus Basic

Stringwise, we are cursed to live in interesting times. The world according to Microsoft (and many other international companies) is moving from ANSI to Unicode characters, but the transition isn’t exactly a smooth one.

Most of the Unicode confusion comes from the fact that we are in the midst of a comprehensive change in the way characters are represented. The old way uses the ANSI character set for the first 256 bytes but reserves some characters as double byte character prefixes so that non-ANSI character sets can be represented. This is very efficient for the cultural imperialists who got there first with Latin characters, but it’s inefficient for those who use larger character sets such as Chinese ideograms and Sumerian hieroglyphics. Unicode represents all characters in two bytes. This is inefficient for the cultural imperialists (although they still get the honor of claiming most of the first 128 characters with zero in the upper byte), but it’s more efficient (and more fair) for the rest of the world. Instead of having 256 unique characters, you can have 65,535—enough to handle all the characters of almost all the world’s languages.

Eventually, everybody will use Unicode, but different systems have chosen different ways of dealing with the transition.

Windows 3.x. Doesn’t know a Unicode from a dress code and
never will.
16-bit COM. Ditto.
Windows NT. Was written from the ground up—first to do the right thing (Unicode) and second to be compatible (ANSI). All strings are Unicode internally, but Windows NT also completely supports ANSI
by translating internal Unicode strings to ANSI strings at runtime. Ob- viously, Windows NT programs that use Unicode strings directly can be more efficient by avoiding frequent string translations, although just as obviously, Unicode strings take about twice as much data space.
Windows 95. Is based largely on Windows 3.x code and therefore uses ANSI strings internally. Furthermore, it doesn’t support Unicode strings even indirectly in most contexts—with one big exception.
32-bit Component Object Module. Was written from the ground up to do the right thing (Unicode) and to hell with compatibility. COM doesn’t do ANSI. The COM string types—OLESTR and BSTR—are Unicode all the way. Any 32-bit operating system that wants to do COM must have at least partial support for Unicode. Windows 95 has just enough Unicode support to make COM work.
Visual Basic. The Basic designers had to make some tough decisions about how they would represent strings internally. They might have chosen ANSI because it’s the common subset of Windows 95 and Windows NT, and converted to Unicode whenever they needed to deal with COM. But since Visual Basic version 5 is COM inside and out (as Chapters 3 and 10 will pound into your head), they chose Unicode as the internal format, despite potential incompatibilities with Windows 95. The Unicode choice caused many problems and inefficiencies both for the developers of Visual Basic and for Visual Basic developers—but the alternative would have been worse.
The Real World. Most existing data files use ANSI. The WKS, DOC, BAS, TXT, and most other standard file formats use ANSI. If a system uses Unicode internally but needs to read from or write to common data formats, it must do Unicode to ANSI conversion. Someday there will be Unicode data file formats, but it might not happen in your
lifetime.

What does this mean for you? Trouble.