1.3. Format Indicators

<+sup>X<-sup>

superscript

<+sub>X<-sub>

subscript

<join>

zero-width joiner

<no-join>

zero-width non-joiner

<break>

zero-width space

<no-break>

zero-width non-breaking space

<font variant>

specialized font variation

<+circled>X<-circled>

circled characters (single characters may use U+20DD)


The superscript and subscript indicators show that if the compatibility characters are replaced by the corresponding nominal forms, they should be put into that format. The form <+x> indicates the start of the format; while the form <-x> indicates the end. In determining canonical equivalency, as per Section 4.4, p. 10, the decision about whether or not to distinguish between characters with and characters without formatting information is left up to the implementation.

The <join> and <no-join> indicators show that it may be necessary to add a ZERO-WIDTH NON-JOINER or ZERO-WIDTH JOINER:

[200C]

ZERO WIDTH NON-JOINER

[200D]

ZERO WIDTH JOINER


This can be simplified as follows:

X<join><join>Y

® X<join>Y

X<no-join><no-join>Y

® X<no-join>Y

X<join>Y

® XY if X and Y would join anyway under contextual analysis

X<no-join>Y

® XY if X and Y would not join anyway under contextual analysis


The <break> and <no-break> indicators show that it may be necessary to add a ZERO-WIDTH SPACE or ZERO-WIDTH NO-BREAK SPACE:

[200B]

ZERO WIDTH SPACE

[FEFF]

ZERO WIDTH NO-BREAK SPACE {BYTE ORDER MARK}


This can be simplified as follows:

X<no-break><no-break>

® X<no-break>Y

X<break><break>Y

® X<break>Y

X<no-break>Y

® XY if X and Y would not have a word break between them anyway

X<break>Y

® XY if X and Y would have a word break between them anyway


Characters that are specialized font variants may be marked with an indicator <font variant> if they different significantly from the nominal form in shape, position or width. (Characters in the compatibility zone are generally subject to such differences, and are not specially marked.)

The Unicode Standard, Version 1.1, does not provide decompositions for mathematical operators. The Unicode Standard generally avoids coding the negated operations such as [2260] NOT EQUAL TO, since they can be composed using the non-spacing mark [0338] COMBINING LONG SOLIDUS OVERLAY.9 Some precomposed negated operators were encoded, however, for compatibility. Dingbats and APL symbols are also treated as atomic.

1 Those that are neither precomposed nor compatibility characters.

2 Non-ideographic characters: The ideographic characters are not defined by name, and have not changed from Unicode 1.0.

3 The APL symbols could all have been decomposed; however, they are treated as a closed set of non-extensible operators, and it did not appear worthwhile to provide decompositions at this time.

4 As in any other case, you can file bug reports against the mapping list. Contact the Unicode Consortium for more information.

5 UCS is an abbreviation for the 10646 character set. Unicode is identical in code and repertoire with the 2-byte form, UCS-2.

6 Unicode only requires values up to FFFF and so only uses multi-byte characters of lengths up to 3, but for completeness the full ranges of the format are described.

7 The APL symbols could all have been decomposed; however, they are treated as a closed set of non-extensible operators, and it did not appear worthwhile to provide decompositions at this time.

8 For example, [00AA]* FEMININE ORDINAL INDICATOR= <+sup> [0061] LATIN SMALL LETTER A {& [0331] COMBINING MACRON BELOW} <-sup>

9 This character is used to express negation, and may change its appearance (slope, length, position and weight) depending on the shape of the modified operation. To reflect a specific glyphic variant of a negated operator in special cases, other non-spacing slashes may be used.