4.4. Unicode Character Equivalence
Unicode 1.0 defines two sequences of non-spacing marks to be equivalent if they do not interact typographically (Volume 1, p. 18, paragraph 2; and p. 21, item 3). This works well in cases such as the examples cited, such as NON-SPACING DOT BELOW and NON-SPACING DOT ABOVE, which clearly do not interact.
However, a general, consistent method of determining character string equivalence requires an explicit algorithm, which is provided below. Basically it works as follows. Every Unicode non-spacing mark has an associated non-spacing priority (spacing marks have a null priority). Whenever a character is encountered that has a non-null priority, a reordering algorithm is invoked. Essentially, any sequence of non-null priority marks is sorted based on the priority. This algorithm represents a logical description of the process: Optimized algorithms can be used in implementations as long as they are equivalent (that is, they produce the same result).
- Two sequences of characters are equivalent if their canonical ordering representation is identical.
- The canonical ordering representation of a string of characters is determined in the following way:
- Decompose all precomposed characters in the string, based upon Appendix I: Unicode 1.1 Character List, p. 43. This will form the maximal decomposition of the string into component characters.
- Assign a canonical priority to each character, based upon Appendix D: Canonical Ordering Priorities, p. 23. Let p(A) be the priority of the character A.
- Sort the string by successively exchanging each pair (A, B) of adjacent characters wherever p(B) _ 0 & p(A) > p(B).
Examples:
a + underdot + diaeresis | equals | a + diaeresis + underdot |
a + diaeresis + underdot | equals | a + diaeresis + underdot |
Since underdot has a larger non-spacing priority than circumflex, the algorithm will return the a, then the diaeresis, then the underdot. However, since diaeresis and breve have the same non-spacing priority (because they interact typographically), they do not rearrange:
a + breve + diaeresis | does not equal | a + diaeresis + breve |
a + diaeresis + breve | does not equal | a + breve + diaeresis |
Thus we get the following results when applying the algorithm. If the results compare as equal, then the originals are equivalent.
Original | Decompose | Sort | Result |
a-diaeresis + underdot | a + diaeresis + underdot | | a + diaeresis + underdot |
a + diaeresis + underdot | | | a + diaeresis + underdot |
a + underdot + diaeresis | | a + diaeresis + underdot | a + diaeresis + underdot |
a-underdot + diaeresis | a + underdot + diaeresis | a + diaeresis + underdot | a + diaeresis + underdot |
a-diaeresis + breve | a + diaeresis + breve | | a + diaeresis + breve |
a + diaeresis + breve | | | a + diaeresis + breve |
a + breve + diaeresis | | | a + breve + diaeresis |
a-breve + diaeresis | a + breve + diaeresis | | a + breve + diaeresis |
Characters have the same priority if they interact typographically; different priorities if they do not. Enclosing characters have the priority of base characters.
Note Base characters never sort relative to one another, so the amount of work is limited by the number of non-spacing marks in a row.
This algorithm establishes the canonical equivalence of two sequences of characters. For example, this algorithm establishes the canonical equivalence of o + diaeresis to ö. This should not be confused with language-specific collation or matching, which may add additional information. For example, in Swedish, ö is treated as a completely different letter from o, collated after z. In German, ö is weakly equivalent to oe, and collated with oe. In English or French, ö is just an o with a diacritic that indicates that it is pronounced separately from the previous letter (as in coöperate), and is collated with o.
Collation sequences may not require correct sorting outside of a given domain, and may not choose to invoke the canonical equivalency algorithm for excluded characters. For example, an English collator may not need to sort Cyrillic letters properly: In that case, it does not have to maximally decompose and reorder Cyrillic letters, and may just choose to sort them according to Unicode order.