Click to return to the Web Content Management home page    
Web Workshop  |  Web Content Management

Unicode, UTF-8, UCS-2, UCS-4 ... What Is All of This?


Amy Burns
Producer
Microsoft Corporation

January 8, 1998

The following article was originally published in the Site Builder Network Magazine.

If you are dealing with software or Internet localization, you have inevitably heard the word Unicode. So what is it, how does it work, and what is it good for?

Unicode, also known as UCS-2 or Universal Character Set-2, is a worldwide character-coding system designed to support the interchange, processing, and display of many modern written languages. Basically Unicode consists of number representations for almost all the characters found in modern written languages. In the current version of Unicode 2.0, there are 38,885 distinct coded characters. These are the "supported scripts," and they include the principle written languages of Europe, the Americas, the Middle East, India, Africa, Asia, and Pacifica. Unicode at this time does not support all written languages.

Here are some examples of the number representation in ASCII and Unicode values:


Character ASCII Unicode Value (decimal)
' ' 39
( ( 40
) ) 41
, , 44
- - 45
. . 46
/ / 47
: : 58
? ? 63

So Who's in and Who's out?

Scripts supported by the Unicode consortium are broken down into two types, primary scripts and pseudo scripts. Primary scripts are described by the Unicode consortium as, "a script whose function is to represent the primary linguistic information in a writing system. All primary scripts represent sound, meaning, or a combination of both." A pseudo script is defined as "a collection of symbols, which is used to represent the secondary linguistic information in a writing system. For example, collections of punctuation, symbols, numbers, shapes, etc." Pseudo scripts are also called secondary scripts.

At the completion of version 2.0, the following primary scripts were supported.


Arabic Gurmkhi Lao
Armenian Han Malayalam
Bengali Hangul Oriya
Bopomofo Hebrew Phonetic
Cyrillic Hiragana Tamil
Devanagari Kannada Telugu
Georgian Katakana Thai
Greek Latin Tibetan
Gujarati    

The pseudo scripts supported as of version 2.0 are:


Numbers Mathematical Symbols Arrows, Blocks, Box Drawing Forms, and Geometric Shapes
General Diacritics Technical Symbols Miscellaneous Symbols
General Punctuation Dingbats Presentation Forms
General Symbols    

The following modern scripts are not fully supported by Unicode version 2.0. Please note that even though Unicode does not directly support some of these languages; some of the languages are commonly written with other scripts that are supported by Unicode.


Burmese Mongolian Tai Lu
Cherokee [accepted for future inclusion] Moso (Naxi) Tai Mau
Cree [accepted for future inclusion] Rong (Lepcha) Yi (Lolo)
Maldivian (Dihevi) Sinhalese (Sri Lankan)  

In addition, Unicode does not support what the consortium considers to be archaic or obsolete scripts. The following are not supported:


Ahom Javanese Pahlavi (Avestan)
Akkadian Cuneiform Kaithi Phags-pa
Aramaic Kawi Pyu
Babylonian Khamti Old Persian Cuneiform
Cuneiform Kharoshthi Phoenician
Balinese Kirat (Limbu) Northern Runes
Balti Lahnda Satavahana
Batak Linear B Siddham
Brahmi Mandaic South Arabian
Buginese Mangyan Sumerian Cuneiform
Chola Manipuri (Meithei) Syriac
Cypro-Minoan Meroitic (Kush) Tagalog
Etruscan Modi Tagbanuwa
Glagolitic Numidian Tircul
Hieroglyphic Egyptian   Ugaritic Cuneiform
Hieroglyphic Hittite    

A 16-bit encoding standard, the Unicode standard permits over 65,000 characters. Only 25,000 are currently not in use, and there are many more languages to be supported! What do we do? In comes USC-4, also called ISO 10646. UCS-4 is a 32-bit encoding standard that is divided into 32,000 planes, each with 65,000 character capacities, for a total of 2,080 million characters. Currently only the first plane of USC-4 is in use (that's Unicode).

This means if you are using Unicode, your text is being broken up every four bytes and sent through the ozone to be reconstructed at the other end. If you are using "supported" scripts, you're okay. It will put your words back the way it found them when they reach their destination. If you use a language that Unicode does not currently support, your text will appear corrupted at the other end. Perhaps the words will be munged, or extra spaces will be added, or some other creative interpretation.

UTF-8: Byte by byte

What do we do if we want to write an application or Internet document in Chinese? We use UTF-8. UTF-8 (Universal Character Set Transformation Format-8) is part of ISO 10646 (remember? UCS-4). It allows 32-bit encoding of ISO 10646, and breaks up your characters between each byte instead of every four bytes.

Here are a couple examples of UTF-8 encoding:

The Unicode character 0xa9 = 10101001 (the copyright sign) encoded as UTF-8 appears as:

11000010 10101001 = 0xc2 0xa9

The character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) encoded as UTF-8 appears as:

1110001010001001 10100000 = 0xe2 0x89 0xa0

How Do I Do This?

First, be sure you are using Internet Explorer version 4.0 or later, so your browser is enabled for Unicode and UTF-8. At the top of each of your HTML pages, directly under the HEAD tag, enter the following META tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Have all of your pages localized by a person fluent in the given language. Save those pages in the correct localized file.

When the Chinese version of the Internet Explorer looks up your localized version of a document, it is informed by the META tag that the incoming information is UTF-8, and that it should handle it according to the standard.

Your localized Web pages show up in beautiful Chinese characters, not one out of place!

Thanks Be for Standards

Diving into the depths of Unicode gets to be a serious lesson in octets, binary, division and positive visualization. The most important point: If it weren't for the International Standards Organization and the Unicode Consortium, cleanly delivering localized applications and documents to a majority of the world would continue to be an extremely awkward process.

Web producer Amy Burns travelled the world to find her way to Microsoft, pausing along the way to teach English in Taiwan, and Web-page building to U.S. high-school students. She also served as a counselor for emotionally troubled teens, which is not that different from her present job.


A bit of history

The Unicode Consortium Non-MS link is a nonprofit organization formed from many high tech companies, including Microsoft, Apple, AT&T and IBM. It was formed in an effort to solve problems involved with creating code pages for localization efforts. The consortium developed Unicode, version 1.1, in 1990. At the same time the International Organization for Standardization (ISO) created another coding standard, ISO 10646. The two groups worked together to combine one standard and by 1993, ISO 10646 and Unicode were identical.



Back to topBack to top

Did you find this material useful? Gripes? Compliments? Suggestions for other articles? Write us!

© 1999 Microsoft Corporation. All rights reserved. Terms of use.