Custom Normalization

Goals

ICU 4.2 offers standard Unicode normalization (all standard forms) for the latest Unicode version, as well as for Unicode 3.2 (needed for IDNA [IDNA2003]). Sometimes, normalization is part of a specification that also includes a mapping step and a validation step. For example, NFKC may be combined with case folding, and IDNA/NamePrep/StringPrep combines a custom mapping table with Unicode 3.2 NFKC and simple (forbidden code points) as well as context-sensitive validation. IDNA 2008 and the related UTS #46 add further mapping+normalization combinations in common use. MacOS X uses custom normalization in file names, and other systems need to be able to match that for compatibility (e.g., for file synchronization [rsync]).

In the past, we have done some of this by

We want to be able to combine a custom mapping with Unicode normalization into one API plus a loadable data file, for better combined performance and for providing a build tool for such data. To minimize additional code and to get good test coverage, standard normalization code should be rerouted to the new code.

We would like

The new data file might also carry optional UnicodeSets, for example:

Status

This has been implemented in ICU 4.4, with the Normalizer2 API (in Java, C++ and C) replacing almost all of the old API. See the User Guide Normalization chapter for the new documentation. This design doc remains, with motivations, decisions and details of the new data structures not documented elsewhere.

ICU 49 modifies the file format slightly (formatVersion 2) in support of getRawDecomposition(c) and composePair(a, b) (ticket #8804), and adding a small amount of data for FCD (ticket #8942).

ICU 60 modifies the file format (formatVersion 3) to put more information directly into the per-character "norm16" trie lookup value. NFC/NFKC/... boundary detection is much simpler, and many mappings are possible without entering the slow path.

The current implementation does not support storing UnicodeSet data in normalization data files.

New API

class Normalizer2: unmodifiable/immutable instances with all-virtual public methods

See the Normalizer2 API (in Java, C++ and C) implemented in ICU 4.4 and above.

Not currently implemented is a  getUnicodeVersion(UVersionInfo v) const;  function. The data is available, but there was no obvious use case since the normalizer's Unicode version is normally the same as the character property Unicode version. (This might not be true for custom data.)

Filtered Normalization

We need to support Unicode 3.2 normalization, and it may be useful to restrict normalization to other sets of characters. While building this into the core normalization code (as in ICU 4.2 and below) is possible and allows single-pass operation, it significantly complicates the code and probably slows it down slightly for normal, unrestricted operation.

Instead, we implement this via UnicodeSet::span(): We (logically) split the input into spans of text covered by the set and spans of text outside the set. Each in-the-set span is normalized, the others are simply appended.

We could provide a wrapper class for this, either a simple one that only supports filtered normalization itself, or one that implements all of the new API.

If we needed better performance, then we could create a dedicated data file that combines the Unicode 3.2 normalization data (including corrigenda) with the NamePrep mappings.

(ICU 4.4 provides the FilteredNormalizer2 class and implements the old API's UNICODE_3_2 option via this class and a UnicodeSet("[:age=3.2:]").)

Old and New Implementation Details

The old normalization data format (unorm.icu, ca. 2001..2009) uses three data structures for normalization: A trie for looking up 32-bit values for every code point, a 16-bit-unit array with decompositions and some other data, and a composition table (16-bit-unit array, linear search list per starter). The data is combined for all 4 standard normalization forms: NFC, NFD, NFKC and NFKD.

In addition, there are several more data structures, for a precomputed FCD table (used in collation and in canonical case comparisons) and for various normalization-related properties even though they don't directly contribute to the actual normalization processing. In ICU4C, the normalization data has been hardcoded in a .c file for a few versions to avoid mutexing in the static API functions.

For a custom normalization table, it does not make sense to combine canonical and compatibility mappings: Any one custom table will have one set of mappings, not two alternatives. Therefore, it makes sense to split the standard normalization data into two data files, one for NFC/NFD and one for NFKC/NFKD. While this may grow the total data size, we think we can make each data file substantially smaller, not only by omitting data specific to the other forms, but also by simplifying the data structure itself which becomes feasible when alternatives need not be included.

It would also be possible to split the decomposition data (sufficient for NFD and for the quick check and decomposition parts of NFC) from the composition data (used for recomposing NFD sequences), and to share the standard composition data among several custom data files rather than duplicating it. While this seems attractive at first glance, we decided against it: Splitting off the composition data would 1. duplicate some of the data between decomposition and composition tables because both need data for canonical combining classes (ccc) and whether a character combines with a preceding one, 2. require a separate trie (or another, fast lookup table) which would add some 6kB of overhead, and 3. application-specific data files relying on the standard composition data would fail to work with a new version of ICU that has normalization data for a newer Unicode version. Standard Unicode normalization has a small number of compositions, taking up maybe 5kB in a space-efficient data structure that also stores decomposition data.

We will need new API for custom normalization. It should have real service objects, rather than the current Normalizer's static worker functions. (Done in ICU 4.4.)

With the new API, we should consider reversing the hardcoding of the normalization data in C source code. The old API functions would then have to go through a mutex, but the new API would be available to avoid this. (Done in ICU 4.4.)

The new code should be able to work with either nfc.nrm or nfd.nrm data files to statisfy requests for both ("nfc"/decompose) and ("nfd"/decompose). Same for NFKC/NFKD. This should be hardcoded only to these names of the standard Unicode normalization forms, and not apply to custom tables. (Not done: NFD uses "nfc"/decompose and NFKD uses "nfkc"/decompose.)

We should probably continue to store a dedicated FCD table in nfc.nrm, but omit it from other files. If FCD operations are requested using a file that does not have the dedicated, pre-built table, then the runtime code can build one on the fly. (Not done: ICU 4.4 always builds the FCD trie at runtime, but only when it is first needed.)

The old implementation works with an optional set of code points "excluded from normalization", that is, making those code points normalization-inert despite what the mapping table says. We have a one-bit option in the public API to request Unicode 3.2 normalization which uses such an "exclusion set". This is rarely used (practically only in IDNA/StringPrep) and complicates the implementation. Instead, we plan to implement "filtered normalization" on top of standard normalization. There is a section about this near the end of this page. (Done in ICU 4.4.)

Main Data Lookup

ICU 4.2 and earlier

The old data yields a 32-bit word per code point:

With the trivial redundancies removed for separate NFC/NFKC data (no second quick check flags set, no redundant combines-back bit), this effectively uses 28 bits.

Analysis of the normalization algorithm and its ICU implementation shows that some combinations of these values cannot occur, and the number of bits can be reduced. It is particularly desirable to encode the most performance-sensitive data (especially for efficient quick checks) in a 16-bit word rather than a 32-bit word, to use a significantly smaller and slightly faster version of the trie.

Per starter that combines forward, old and new data stores a linear, sorted list of (combining-back code point, composite) pairs. It would be faster to use a hash table, but that would be significantly larger. The current strategy is to try to optimize for quick check YesYes characters (including ones with ccc!=0), and the most common back-combining characters tend to have low code points and appear early in the composition list.

Unicode Normalization: Features and Limitations

Possible and Impossible Combinations

Terminology:

The 16-bit values are arranged for maximum efficiency of quick checks which only drop out of a very tight loop when they see a No or Maybe, or ccc≠0.

A simple mapping to one code point can be stored directly in the lookup value, without extra data. This is only possible for NoNo characters because it only works for a one-way (one code point to one other code point) mapping.

ICU does not allow tailoring of Hangul/Jamo mappings and compositions, except to make the relevant characters completely inert.

MaybeNo is both forbidden and irrelevant:

NoYes is impossible: If it has no mapping, it will occur in NFC.

A YesNo only ever decomposes into a YesYes+MaybeYes sequence or a YesNo+MaybeYes sequence. That is, a YesNo's decomposition (A=B+C) decomposes further if and only if the first (B) of its two components has a decomposition.

A YesNo always has ccc=lccc=0.

Only a starter can combine-forward, therefore no character can have ccc≠0 and combine forward.

A NoNo can have any of its components decompose further, but this is only visible in the raw mappings. The regular mappings are fully decomposed.

NoNo with combine-forward is impossible: A one-way mapping prevents composition (which starts from NFD where there are no decomposable characters).

Per-character lookup values, .nrm formatVersion 3

Since ICU 60

Changes from version 2:

Possible combinations and their encoding:

The rows of the table, from bottom to top, are encoded with increasing 16-bit "norm16" values as noted in the last column. Per-row and per-row-group properties are determined via norm16 range checks.

Per-character lookup values, .nrm formatVersion 1 & 2

Version 1: ICU 4.4..4.8

Version 2: ICU 49..59

Possible combinations and their encoding:

The rows of the table, from bottom to top, are encoded with increasing 16-bit "norm16" values as noted in the last column. Per-row and per-row-group properties are determined via norm16 range checks.

The minYesNoMappingsOnly distinction was added in ICU 49, .nrm formatVersion 2.0. In formatVersion 1.0, the data for either type of yesNo characters (combines-forward or not) were mixed, and the mapping's firstUnit bit 6 was the MAPPING_PLUS_COMPOSITION_LIST flag. See the firstUnit documentation below.

Additional data indexed by the trie value

(Implemented in ICU 4.4, .nrm formatVersion 1.0. Modified in ICU 49, .nrm formatVersion 2.0 and in ICU 60, .nrm formatVersion 3.0.)

"Extra data" per code point, if it has a mapping or if it combines-forward, is stored in 16-bit-unit arrays. The character's lookup value is an index into one of these arrays. It is probably handy to have two arrays, so that indexes can be allocated independently for the two ranges of 16-bit lookup values that are indexes into extra data.

Threshold values like minYesNo depend on the mapping data.

The mapping string is stored together with the trailing ccc (tccc) value, and also with the ccc and lccc values if they are not 0.

Mapping to an empty string is encoded as a regular mapping with length 0.

If both a mapping and a composition list are stored for a character (only possible for YesNo), the mapping comes first.

Optional mapping

Optional composition list

In the ICU implementation, it is ok to not store the ccc value directly in the lookup value for NoNo characters. When the quick check fails with YesNo, NoNo or MaybeYes, the surrounding sequence is decomposed, which does not use the original characters' ccc values. Composition then sees only YesYes and MaybeYes characters which do have their ccc values in the lookup value.

A composite that combines-forward has quick check flags YesNo, has a mapping, has ccc=0 (it's a starter) and lccc=0 (it composes from a starter plus another character) and has a composition list (it combines-forward).

Old vs. new: The old composition data uses combine-forward and combine-back indexes stored in the extra data next to the mapping. In the new data structure, the combine-forward index is replaced by appending the composition list after the mapping, and the combine-back index is replaced by searching in the list for the back-combining code point itself.

Multiple characters may share data when they have the same mappings. This is especially useful for mappings to empty strings and case and other foldings.

Raw decomposition mappings

It is sometimes useful to have access to the raw/original decomposition mappings, without recursive decomposition.

Some of this data is computable from the current structure, by mapping and then recomposing until the original character would be written. However, this would be expensive and incomplete.

.nrm formatVersion 2 adds compressed data for raw mappings:

This required an incompatible data format change because in the formatVersion 1 "extra data" for each code point there was not one unused bit. However, the first unit's bit 6 [MAPPING_PLUS_COMPOSITION_LIST] was underutilized. It was easy to subsume its actual use into bit 5 [MAPPING_NO_COMP_BOUNDARY_AFTER], freeing bit 6 for its new use as the MAPPING_HAS_RAW_MAPPING flag. As a result, access to the raw mapping is only slightly slower (due to the "compressed" format) than to the normal mapping.

Size vs. Speed

Minimal Mappings vs. Recursively Decomposed

The old data always stores decomposition mappings that cannot be further composed. The builder (gennorm) recursively decomposes each "raw" input mapping with each other applicable mapping. As a result, the decomposition is very fast.

For a smaller data file, it would be possible to store only the "raw" input mappings, without having recursively decomposed them. For example, the Angstrom character could have just the mapping to A-ring, and only A-ring would have the mapping to A+ring. For reasonable performance, one more bit could be stored with each mapping, indicating whether the mapping can be decomposed further. (Code optimization: 1:1 mappings could use a loop, not recursion.) There should also be a flag in the data file for whether all mappings are recursively resolved, or not.

This would especially help with the data size when compatibility mappings and case foldings are included. Many of those are one-way mappings to a single code point each (1:1 mappings).

Note: Extracting FCD data when the mappings are not already recursively decomposed would require full decomposition because the mapping's lccc and tccc values may not be those of the full decomposition.

Not-fully-decomposed 1:n mappings would significantly complicate the runtime code.

Even if the builder does not store recursively decomposed mappings, it should still compute them to check against cycles, and maybe to check against large recursion depths.

(Implemented in ICU 4.4, formatVersion 1.0: Mappings are recursively decomposed, favoring performance. ICU 49, formatVersion 2.0, adds additional, compressed data to recover the raw mappings efficiently and with little size overhead.)

Optimized 1:1 Mappings

The subset of NoNo characters with 1:1 mappings are the most attractive for compact storage because there is relatively a lot of overhead in storing them like sequences. Ideas:

Implemented in ICU 4.4, formatVersion 1.0: Small deltas of -0x40..+0x40 [where 0 is forbidden] for one-way mappings from ccc=0 characters are stored directly in the trie lookup value, see the table above with the 16-bit value ranges.

We should measure how much of the data size we can save, compared with the additional code complexity and lower performance to handle more variants. (Measured: With Unicode 5.2 NFKC + case folding, a simple, fairly small delta of ±0x40 reduced the data file size by 8.8%. Smaller deltas yielded smaller savings. A much larger delta of ±0x1000 only got to 9.3%, so ±0x40 was chosen for formatVersion 1.0.)

Mapping Table Text Files

Normalization data is built from text files. Basic syntax is supported starting in ICU 4.4. The supported syntax is documented in the User Guide Normalization chapter.

Not implemented yet

The following syntax elements might be useful but are not yet supported.

It might be useful to be able to set the override mode, and to start a new "phase" within one source file where it is allowed to override mappings from previous phases:

* override previous  # none|any|previous

It might be useful to add syntax for a "filter" set and a validity set (optional). The syntax needs to indicate type, start and end of the set and allow multiple lines. Line breaks will be removed before passing the set pattern into the UnicodeSet constructor. If a filter set is defined, then the generator tool would ignore any data for code points outside that set.

* filter [:age=3.2:]

* filter [:^

*+         Hani

*+        :]

Some syntax might be useful for the Unicode version (optional). Multiple version specifications are ok only if they have the same value. The Unicode version is otherwise settable via a command-line option.

* unicode 5.2

It might be useful to add syntax for whether Hangul syllables and conjoining Jamo behave as defined in the Unicode standard. By default, they do so, and data for them is not allowed.

* hangul off

New gennorm2 Tool

gennorm generates text files (nfc.txt and nfkc.txt) from UCD files, for standard Unicode normalization. The output text files are checked into icu/source/data/unidata/norm2. gennorm has been moved to the tools repository tree.

gennorm2 is a new, installable (ICU 4.4) end-user tool in the icu repository tree, parsing the .txt files and generating .nrm files.

It should be easy to include the standard Unicode normalization ccc and composition data. One way is to always include the standard ccc data and have syntax for a flag to request standard compositions. (Rejected.)

Another, simpler way is for gennorm2 to take a list of mapping table files, and to provide standard files like ccc.txt, compose.txt, nfd.txt, nfkd.txt and casefold.txt that could be combined (with or without additional custom tables) in various combinations into one binary data file. This would also allow for a character to have different mappings in different files, and the later mapping would override the earlier one. gennorm2 should be able to also output a .txt file with all of the combined data, except without recursively resolved mappings, to keep two-way mappings in the file valid for input. (Done in ICU 4.4. Modification: The NFKC mappings cannot simply add to the NFC mappings because some characters with two-way NFC mappings have one-way NFKC mappings. Therefore, there are separate files that specify each normalization form's mappings.)

We should make it easy to move StringPrep mappings from the .spp files into normalization .txt/.nrm files. (Not done [yet].)