Canonically Equivalent Shortest Form
It is possible to do fast searching that takes into account linguistic equivalents, using a modified Boyer-Moore algorithm. For more information on such matching, see Searching and Matching in UAX#10: Unicode Collation Algorithym.
In order to do this, it is important to determine a canonically equivalent shortest form (CESF). This is usually, but not always, the NFC form. ICU has a CanonicalIterator which can be used to get all the canonically equivalent variants. One could use it to iterate through and select the shortest form. However, it was not designed for this particular task -- more efficient code can be designed that uses a modified NFC algorithm. This document presents some thoughts on how to do this.
In an annex, it also presents a Quick&Dirty Approach using the CanonicalIterator.
Note that here when we say "decomposition" or "equivalence", we always mean canonical. Counts and other information are as of Unicode 5.1.
We want uniqueness for the CESF, and also for it to be the same as NFC where possible. Note that CESF does not have the same stability constraints as Unicode Normalization Forms - it is intended purely for internal processing.
We start by looking at the two kinds of cases where NFC is not shortest, as discussed on nfc-faq.
Breakage. The character breaks apart, as in breakage list. There are 85 such characters.
There is only one case where the first character is not a starter:
U+0344 ( ̈́ ) COMBINING GREEK DIALYTIKA TONOS =>
The non-first characters in the NFC form of other breakage characters are called trail-breakers. There are 24 such characters.
Interference. Some characters interfere with forming the shortest form. These are called irritations. There are 45 such characters.
Luckily, all of these cases are rather rare in practice. The most common ones are the Indic nuktated forms.
An example of interference is the sequence U+1EA5 ( ấ ) LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE + U+0328 ( ̨ ) COMBINING OGONEK. What happens is that the U+0105 ( ą ) LATIN SMALL LETTER A WITH OGONEK forms first, and prevents the formation of the shorter composite. (See below for a detailed example.)
Use the basic NFC algorithm, with the following changes.
Do not exclude the composition exclusions. That will prevent all but one of the breakages above.
Traverse through the segment, as in NFC.
However, where each composition with the starter is possible, recursively process both the result with the composition, and the result without.
If the non-starter is a lead-trail, see if it can be combined with a following character. If so, recursively process both the result with the composition, and the result without. (This is to prevent the other breakage mentioned above.)
The recursion is always depth-first, with composition being the first branch and uncomposed choice being the second.
Among the shortest final results, pick the one that is NFC if there is one. Otherwise, pick the first one (in branch order).
In the following case, each row represents looking at that character in the sequence. The highlighted cells are where a new composition is produced.
There is only one shortest form (the final answer in column 2), so it is the result. If the result in column #2 had not been present, then column #1 would have been the first shortest value (and also NFC), so it would have been chosen as the result.
The above lends itself to many optimizations. Although generating a tree of results seems scary, in practice there can only be a few branches. Moreover, one can easily prune the results along the way. One only needs to recurse, for example, if there is a possibility that there is an interfering character. So a small list of such characters can be tested for to avoid excess processing. Where data size is an issue, the majority of the data and can be shared with the NFC implementation; only the breakage cases need to have a special flag.
Implementation can also be done by postprocessing a string in NFC:
Make a "problem list" of characters that can occur in a case of breakage or interference. This list will need to be generated for each version of Unicode, but can be done at build time. It includes all of the characters in the breakage list, like a nukta character, plus all the characters that are formed in the case of interference (like U+0105 ( ą ) LATIN SMALL LETTER A WITH OGONEK plus circumflex plus grave, from the above case).
After the string is in NFC, take another pass through, looking for any starter sequence where the starter and at least one non-starter are in the problem list. Because the non-starters are quite rare in practice (see nfc-faq) a quick scan can be done for them, backing up to check the preceding starter if one is found. In addition, in each sequence of non-starters, check for the sequence:
All of these problem starter sequences should be relatively rare in practice, allowing for a very quick scan. If we find a candidate, find its enclosing segment (see below) and do the above processing.
See also the Quick&Dirty Approach. It won't give precisely the same results, but should be sufficient.
Annex: Types of Characters
For background information, here are some relevant types of characters. Because Hangul is unproblematic, we can exclude that from the counts below for simplicity.
A character is decomposable when it has a decomposition. There are 2,043 such characters.
A character is lead-only when it is in the multi-character decomposition of some character, but never except as the first character. There are 295 such characters.
Exactly one of these is a non-starter: U+0F71 ( ཱ ) TIBETAN VOWEL SIGN AA
A character is trail-only when it is in the multi-character decomposition of some character, but never as first. There are 75 such characters.
Eighteen of these are starters.
A character is lead-trail when it can be both leading and trailing. There is only one such character, a non-starter.
A character is reorderable if it is not a starter, and none of the above. There are 444 such characters. These characters don't combine, but they do reorder.
A character is stable when it is a starter, and is none of the above. These are the rest of the million-odd code points in Unicode (excluding Hangul).
Logically speaking, we'll assume that the string is already transformed into NFD.
We can logically break up any NFD string into a set of segments, where each segment starts with a lead-only starter or stable character, and has zero or more others characters afterward. There is also the first segment, which may be "headless". Each segment does not interact with any characters before it or after it, in terms of canonical equivalence. Thus if each segment is mapped to the CESF, then the string as a whole will be in CESF. So then the problem devolves to mapping the segments.
Luckily, we can do this with the same mechanism as in NFC, looking at sequences that are starter + non-starter. [TBD: show why this is sufficient.]
Here is a relatively easy way to get a shortest form that would probably suffice for collation in terms of function and performance: Function, because it appears that we don't actually need uniqueness, just the length of the shortest form; Performance, just because the bad cases are so rare.
Convert to NFC.
Traverse the NFC string by starter-sequence.
If the segment looks problematic, then use the CanonicalIterator to get a shortest form, and substitute.
The problematic segments are those of more than one character that meet any of the following conditions: