4.4. Unicode Character Equivalence

Unicode 1.0 defines two sequences of non-spacing marks to be equivalent if they do not interact typographically (Volume 1, p. 18, paragraph 2; and p. 21, item 3). This works well in cases such as the examples cited, such as NON-SPACING DOT BELOW and NON-SPACING DOT ABOVE, which clearly do not interact.

However, a general, consistent method of determining character string equivalence requires an explicit algorithm, which is provided below. Basically it works as follows. Every Unicode non-spacing mark has an associated non-spacing priority (spacing marks have a null priority). Whenever a character is encountered that has a non-null priority, a reordering algorithm is invoked. Essentially, any sequence of non-null priority marks is sorted based on the priority. This algorithm represents a logical description of the process: Optimized algorithms can be used in implementations as long as they are equivalent (that is, they produce the same result).

Examples:

a + underdot + diaeresis

equals

a + diaeresis + underdot

a + diaeresis + underdot

equals

a + diaeresis + underdot


Since underdot has a larger non-spacing priority than circumflex, the algorithm will return the a, then the diaeresis, then the underdot. However, since diaeresis and breve have the same non-spacing priority (because they interact typographically), they do not rearrange:

a + breve + diaeresis

does not equal

a + diaeresis + breve

a + diaeresis + breve

does not equal

a + breve + diaeresis


Thus we get the following results when applying the algorithm. If the results compare as equal, then the originals are equivalent.

Original

Decompose

Sort

Result


a-diaeresis + underdot

a + diaeresis + underdot

a + diaeresis + underdot

a + diaeresis + underdot

a + diaeresis + underdot

a + underdot + diaeresis

a + diaeresis + underdot

a + diaeresis + underdot

a-underdot + diaeresis

a + underdot + diaeresis

a + diaeresis + underdot

a + diaeresis + underdot

a-diaeresis + breve

a + diaeresis + breve

a + diaeresis + breve

a + diaeresis + breve

a + diaeresis + breve

a + breve + diaeresis

a + breve + diaeresis

a-breve + diaeresis

a + breve + diaeresis

a + breve + diaeresis


Characters have the same priority if they interact typographically; different priorities if they do not. Enclosing characters have the priority of base characters.

Note Base characters never sort relative to one another, so the amount of work is limited by the number of non-spacing marks in a row.

This algorithm establishes the canonical equivalence of two sequences of characters. For example, this algorithm establishes the canonical equivalence of o + diaeresis to ö. This should not be confused with language-specific collation or matching, which may add additional information. For example, in Swedish, ö is treated as a completely different letter from o, collated after z. In German, ö is weakly equivalent to oe, and collated with oe. In English or French, ö is just an o with a diacritic that indicates that it is pronounced separately from the previous letter (as in coöperate), and is collated with o.

Collation sequences may not require correct sorting outside of a given domain, and may not choose to invoke the canonical equivalency algorithm for excluded characters. For example, an English collator may not need to sort Cyrillic letters properly: In that case, it does not have to maximally decompose and reorder Cyrillic letters, and may just choose to sort them according to Unicode order.