F.2. Specification

The FSS-UTF encodes character values in the range [0, 0x7FFFFFFF]6 using multi-byte characters of lengths 1, 2, 3, 4, 5, and 6 bytes. For all encodings of more than one byte, the initial byte determines the number of bytes used by setting 1 in the equivalent number of high-order bytes. The next most significant bit is always 0. For example, a 2-byte sequence starts with 110 and a 6-byte sequence starts with 1111110.

The following table shows the format of the first byte of a character; the free bits available for coding the character are indicated by an x.

Byte

Value

Bits Free

First of 2 bytes

110xxxxx

5

First of 3 bytes

1110xxxx

4

First of 4 bytes

11110xxx

3

First of 5 bytes

111110xx

2

First of 6 bytes

1111110x

1

All subsequent bytes

10xxxxxx

6


Therefore, any byte that does not start with 10 is the start of an FSS-UTF character sequence. The table below illustrates the FSS-UTF:

Bits

Hex Min

Hex Max

Byte Sequence in Binary

7

00000000

0000007f

0vvvvvvv

11

00000080

000007FF

110vvvvv 10vvvvvv

16

00000800

0000FFFF

1110vvvv 10vvvvvv 10vvvvvv

21

00010000

001FFFFF

11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

26

00200000

03FFFFFF

111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

31

04000000

7FFFFFFF

1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv


The Unicode value is just the concatenation of the v bits in the multibyte encoding. When there are multiple ways to encode a value, for example U+0000, only the shortest encoding is legal.