Core Breakup

Here is a sketch of what we could do:

Refactor core (ICU4J) into 3 pieces. The motivation is to separate out everything that is involved in formatting (and parsing) into a separate project. So we'd have:

Format - Collate - Translit // higher level modules



    • Format contains formatting (and parsing) classes (from the old Core)

    • Core contains all the stuff that doesn't depend on any (substantive) language-specific format data. We make a CoreLocale that has all the functionality of ULocale other than formatting-dependent stuff. We widen out all the input ULocale parameters in ICU to use CoreLocale instead.

    • FormatCore contains stuff from the old Core needed by more than one of Format/Collate/Translit, with as little formatting info as we can - it is a bit of a hack, but necessary for compatibility. ULocale is the main item here, and really only needed for getAvailableLocales.

So the remaining Core has stuff like:

    • UCharacter/UProperty/UScript

    • Math stuff

    • CoreLocale (no display name, no available locales, etc)

    • LocaleData

    • PluralRules

    • UnicodeSet

    • Properties

    • ResourceBundles

    • BreakIterator

    • Calendar

    • BOCU/SCSU/PunyCode/IDNA/StringPrep

    • Bidi/Shaping

    • Charset detection

    • Character iterators

    • DigitList

    • Normalizer

    • Utilties (Freezable, Measure, MeasureUnit, VersionInfo,...)

    • UniversalTimeScale

    • MessageFormatBase


    • We also split UnicodeSet into two parts: CoreUnicodeSet and UnicodeSet, with a separate UnicodeSetFormat class for format/parsing.

    • UnicodeSet would inherit from CoreUnicodeSet, and have-a UnicodeSetFormat.

    • We'd widen our input parameters where possible to take a CoreUnicodeSet, and replace all the Core usages.

    • CoreUnicodeSet has no parsing, and only very simple formatting (lists of ranges). CoreUnicodeSet adds a couple of extra methods that use UCharacter properties directly, eg addAllIntProperty(int propEnum, int valueEnum). For constants like UnicodeSet foo = new UnicodeSet("[[:cn:][:co:]]"), someone (and all our Core code) could use these methods, eg

      • UnicodeSet foo = new UnicodeSet("[[:cn:][:co:]]")

      • =>

      • UnicodeSet foo = new UnicodeSet()

      • .addAllIntProperty(UProperty.GENERAL_CATEGORY, UCharacter.UNASSIGNED)

      • .addAllIntProperty(UProperty.GENERAL_CATEGORY, UCharacter.PRIVATE_USE);

From Doug:

The data chunks I've thought of splitting out are locale display name

data, time zone name data, and currency format data. Formatting

numbers ought to be pretty cheap, but the locale display names are

heavy. Once they're split out, the currency display names are next.

Formatting dates and times ought to be relatively cheap as long as you

don't need long time zone display names. Currently, to do either of

these, you drag along lots of stuff.

I'm not sure formatters even really need core property data.

UnicodeSet drags in a log of it, when really all that's needed is a

quick way to build a character set for some scanners. This is almost

always just a property list that gets turned into a set. These sets

could be generated at build time instead.

Besides this, there are some APIs that sit at a high level and call

down into multiple packages. GlobalizationPreferences and

StringSearch were the initial examples I hit. I put both into

Collation when I did the original refactoring but they should be

pulled into some 'client' module that depends on multiple other


I did some more work on MessageFormat, and it looks promising. I separated out an interface, FormatRegistry, and refactored an implementation that just does the old stuff (SimpleFormatRegistry). Later on it can be a more real version, where we can actually register new stuff, but for now it is equivalent in functionality. If this looks good, I'd like to go ahead with it.

In MessageFormat, I have:

static final FormatRegistry formatRegistry = (FormatRegistry) newInstanceOrNull("...SimpleFormatRegistry", null);

In Utility, I have


* Utility for refactoring

* @param className

* @return A new instance of that class, or null if none available.


public static Object newInstanceOrNull(String className, Object fallback) {

try {

return (Class.forName(className)).newInstance();

} catch (Exception e) {

return fallback;



Here is the new interface.

// internal for now

public interface FormatRegistry {


* Return a default Format for a given type of object.


* @param obj the input Object. For example, for an object of type Number, a NumberFormat would be appropriate to return.

* @param ulocale

* @return default format, or null if none available for that type of object.


public abstract Format getFormatForObject(Class classType, ULocale ulocale);


* Return a key, like "number", or "number,currency", or "number,#0,0#". If

* that key were passed into getFormat (with the same uLocale), then a

* format would be generated that would be equal to this one.


* @param format The format to generate a key for.

* @param ulocale

* @return


public abstract String getKey(Format format, ULocale ulocale);


* From a key of the form mainType, subType, return a format.


* @param mainType

* Guaranteed to be non-empty.

* @param subType

* May be empty or not. An empty subtype always works (if the mainType is valid).

* @param ulocale

* @exception IllegalArgumentException

* thrown if the mainType is not valid, or or the subType

* invalid for the mainType.

* @return


public abstract Format getFormat(String mainType, String subType, ULocale ulocale);