Understanding the Unicode Standard

The Unicode standard defines codes for characters in most major languages written today. Scripts include Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan, Japanese Kana, the complete set of modern Korean Hangul, and a unified set of Chinese/Japanese/Korean (CJK) ideographs. There are also several other scripts that have recently been added, including Ethiopic, Canadian Syllabics, Cherokee, Sinhala, Syriac, Burmese, Khmer, and Braille.

The Unicode standard also includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, and dingbats. It supports diacritics, which are character marks such as the tilde (~). Diacritics are used in conjunction with base characters to encode accented or vocalized letters; for example, ñ. In all, the Unicode standard provides codes for nearly 39,000 characters from the world's alphabets, ideograph sets, and symbol collections.

In addition, there are approximately 18,000 unused code values that have been reserved for future use. The Unicode standard also contains 6,400 code values that software and hardware developers can assign internally for their own characters and symbols.