Unicode, UTF-8, UCS-2, UCS-4 ... What Is All of This?

Amy Burns
Producer
Microsoft Corporation

January 8, 1998

The following article was originally published in the Site Builder Network Magazine.

If you are dealing with software or Internet localization, you have inevitably heard the word Unicode. So what is it, how does it work, and what is it good for?

Unicode, also known as UCS-2 or Universal Character Set-2, is a worldwide character-coding system designed to support the interchange, processing, and display of many modern written languages. Basically Unicode consists of number representations for almost all the characters found in modern written languages. In the current version of Unicode 2.0, there are 38,885 distinct coded characters. These are the "supported scripts," and they include the principle written languages of Europe, the Americas, the Middle East, India, Africa, Asia, and Pacifica. Unicode at this time does not support all written languages.

Here are some examples of the number representation in ASCII and Unicode values:

Character ASCII	Unicode Value (decimal)
' '	39
( (	40
) )	41
, ,	44
- -	45
. .	46
/ /	47
: :	58
? ?	63

So Who's in and Who's out?

Scripts supported by the Unicode consortium are broken down into two types, primary scripts and pseudo scripts. Primary scripts are described by the Unicode consortium as, "a script whose function is to represent the primary linguistic information in a writing system. All primary scripts represent sound, meaning, or a combination of both." A pseudo script is defined as "a collection of symbols, which is used to represent the secondary linguistic information in a writing system. For example, collections of punctuation, symbols, numbers, shapes, etc." Pseudo scripts are also called secondary scripts.

At the completion of version 2.0, the following primary scripts were supported.

Arabic	Gurmkhi	Lao
Armenian	Han	Malayalam
Bengali	Hangul	Oriya
Bopomofo	Hebrew	Phonetic
Cyrillic	Hiragana	Tamil
Devanagari	Kannada	Telugu
Georgian	Katakana	Thai
Greek	Latin	Tibetan
Gujarati

The pseudo scripts supported as of version 2.0 are:

Numbers	Mathematical Symbols	Arrows, Blocks, Box Drawing Forms, and Geometric Shapes
General Diacritics	Technical Symbols	Miscellaneous Symbols
General Punctuation	Dingbats	Presentation Forms
General Symbols

The following modern scripts are not fully supported by Unicode version 2.0. Please note that even though Unicode does not directly support some of these languages; some of the languages are commonly written with other scripts that are supported by Unicode.

Burmese	Mongolian	Tai Lu
Cherokee [accepted for future inclusion]	Moso (Naxi)	Tai Mau
Cree [accepted for future inclusion]	Rong (Lepcha)	Yi (Lolo)
Maldivian (Dihevi)	Sinhalese (Sri Lankan)

In addition, Unicode does not support what the consortium considers to be archaic or obsolete scripts. The following are not supported:

Ahom	Javanese	Pahlavi (Avestan)
Akkadian Cuneiform	Kaithi	Phags-pa
Aramaic	Kawi	Pyu
Babylonian	Khamti	Old Persian Cuneiform
Cuneiform	Kharoshthi	Phoenician
Balinese	Kirat (Limbu)	Northern Runes
Balti	Lahnda	Satavahana
Batak	Linear B	Siddham
Brahmi	Mandaic	South Arabian
Buginese	Mangyan	Sumerian Cuneiform
Chola	Manipuri (Meithei)	Syriac
Cypro-Minoan	Meroitic (Kush)	Tagalog
Etruscan	Modi	Tagbanuwa
Glagolitic	Numidian	Tircul
Hieroglyphic Egyptian		Ugaritic Cuneiform
Hieroglyphic Hittite

A 16-bit encoding standard, the Unicode standard permits over 65,000 characters. Only 25,000 are currently not in use, and there are many more languages to be supported! What do we do? In comes USC-4, also called ISO 10646. UCS-4 is a 32-bit encoding standard that is divided into 32,000 planes, each with 65,000 character capacities, for a total of 2,080 million characters. Currently only the first plane of USC-4 is in use (that's Unicode).

This means if you are using Unicode, your text is being broken up every four bytes and sent through the ozone to be reconstructed at the other end. If you are using "supported" scripts, you're okay. It will put your words back the way it found them when they reach their destination. If you use a language that Unicode does not currently support, your text will appear corrupted at the other end. Perhaps the words will be munged, or extra spaces will be added, or some other creative interpretation.

UTF-8: Byte by byte

What do we do if we want to write an application or Internet document in Chinese? We use UTF-8. UTF-8 (Universal Character Set Transformation Format-8) is part of ISO 10646 (remember? UCS-4). It allows 32-bit encoding of ISO 10646, and breaks up your characters between each byte instead of every four bytes.

Here are a couple examples of UTF-8 encoding:

The Unicode character 0xa9 = 10101001 (the copyright sign) encoded as UTF-8 appears as:

11000010 10101001 = 0xc2 0xa9

The character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) encoded as UTF-8 appears as:

1110001010001001 10100000 = 0xe2 0x89 0xa0

How Do I Do This?

First, be sure you are using Internet Explorer version 4.0 or later, so your browser is enabled for Unicode and UTF-8. At the top of each of your HTML pages, directly under the HEAD tag, enter the following META tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Have all of your pages localized by a person fluent in the given language. Save those pages in the correct localized file.

When the Chinese version of the Internet Explorer looks up your localized version of a document, it is informed by the META tag that the incoming information is UTF-8, and that it should handle it according to the standard.

Your localized Web pages show up in beautiful Chinese characters, not one out of place!

Thanks Be for Standards

Diving into the depths of Unicode gets to be a serious lesson in octets, binary, division and positive visualization. The most important point: If it weren't for the International Standards Organization and the Unicode Consortium, cleanly delivering localized applications and documents to a majority of the world would continue to be an extremely awkward process.

Web producer Amy Burns travelled the world to find her way to Microsoft, pausing along the way to teach English in Taiwan, and Web-page building to U.S. high-school students. She also served as a counselor for emotionally troubled teens, which is not that different from her present job.

A bit of history

The Unicode Consortium is a nonprofit organization formed from many high tech companies, including Microsoft, Apple, AT&T and IBM. It was formed in an effort to solve problems involved with creating code pages for localization efforts. The consortium developed Unicode, version 1.1, in 1990. At the same time the International Organization for Standardization (ISO) created another coding standard, ISO 10646. The two groups worked together to combine one standard and by 1993, ISO 10646 and Unicode were identical.