ICU 4.4


    • Unicode 5.2

    • CLDR 1.8

    • More compact data formats

    • ICU4J modularization


Tier 1

  • Regex using abstract text access APIs (UText), roll in work by Jordan Rose: #4521 (ensure perf OK, ensure UTF8 support)

  • Ensure that there are APIs providing access to all CLDR data: e.g. #4836, #5478, etc. (Google also interested) (Peter has CLDR task to enumerate the data that is missing; based on that we can file additional bugs and divide up the work)

  • Improved search capabilities (Peter to generate design doc) - mainly asymmetric search, i.e. type e, match e,é,è; type é, match é but probably not e and certainly not è (#7093) (Google also interested). Other possibilities (lower priority) include:

    • Position-dependent matching? (e.g. Arabic HEH and TEH MARBUTA should match for a search when both are at the end of a word)

    • Use of search object distinct from collator? (Possible optimization, may not be necessary, not of interest to others)

  • Reduce ICU4C dynamically-allocated memory, especially for time zone data (more compact data formats may help with this): #6873, #6879 (Google also interested) (Peter will look at porting Yoshito's ICU4J work to C; requires interpreting const in a "logical" way - can do lazy loading, just make sure thread safe. Should document this interpretation. Peter to coordinate with Andy on this)

Tier 2

    • Number spellout format & parse support for CJK numbers, including in dates. Note: CLDR 1.7 added relevant capability per cldrbug:1927; is there anything else that needs to be done in ICU (may work if appropriate patterns are used, Peter will do some experiments)

    • Support >2GB text length for search, regex, text break, encoding conversion, perhaps transliteration. Use of UText will provide appropriate interfaces for regex and RBBI with additional internal changes. #5451 is for the RBBI changes.

    • Encoding detection for a wider range of encodings, with some finer distinctions. For example pure ShiftJIS text should return both ShiftJIS and cp932 with 100% confidence; text including cp932 extensions should also return both but with lower confidence for ShiftJIS.

    • Additional conversion tables (not necessarily in default build). Don't need a ticket for this yet.

Already implemented on branch

    • Lenient parsing for numbers (#4942, #6109), date/time (#3579)


Tier 1

    • General

      • Spoof detection in ICU4J (#7094) (IBM has also expressed interest)

      • Complete BCP47 Support (6177, 6916, 6791, 6860)

      • Switch data formats to use UTrie2— requires UTrie2 in ICU4J as well (#7077)

      • (apple) Provide APIs to access all CLDR data #4836, #5478

      • Fully customizable data for objects, formatters, and others; there are holes

    • Normalization/IDNA

      • Support custom normalization data (#7273)

      • Support UTS #46 (#7144) [via custom normalization data]

    • Formatting

      • Able to select numbering systems (#7160)

      • BigDecimal format/parse in ICU4C (#5193)

    • Footprint

      • Split locale display names, time zone names, currency names out of locale data (into separate data) (#7163)

      • Introduce Locale base class — only id, no display names (#7164)

      • Introduce MessageFormat base class — only string substitution (#7165)

      • Smaller "core" set (without formatting data, only manipulation/algorithms) (#7161)

      • Generalized cache management for ICU4C (#2863, #3035, #3118, #6029, #6030, #6031, #6099, #6708)

      • More compact collation (rules/binary) data

        • Compact collation tailoring syntax for lists of characters with same level difference (#7015)

        • Add import rule to collation tailoring syntax (#7023, CldrBug:2268)

      • Locale data filtering: display names for fewer codes

      • (apple) Reduce C heap memory usage, especially for time zone #6873, #6879

Tier 2

    • General

      • Collation script reordering (#3984, CldrBug:2267)

      • Best match for locale IDs: #4712

      • Better C++ implementation (scoped_ptr, byte string class; see design/C++ page) (#7162)

      • Cast reduction — move methods into base

    • Formatting

      • Day divisions — more than just am/pm, e.g. Chinese 'noon', 'midnight' (#7150, CLDR#713)(Apple also interested in this; Peter has a CLDR bug)

      • More kinds of duration formatting (Apple also interested in this; Peter has a CLDR bug)

    • Footprint

      • Filter out locales with insufficient data


  • General focus: Usability, Maintainability and Performance

    • Code and Data Maintainability Improvements, e.g. Separating timezone data from code.

    • Overriding/updating locale information in an ICU installation: 4597 6633

    • Collation and string search service code clean up: 4562

    • Misc layout bug fixes: 5589 6625 5431 6113 6182

    • Improved ICU performance and regression for selected service areas only, e.g. Collation

    • Extended IETF BCP47 support: language/locale specification for HTTP/XML/OpenJDK

    • Lenient parsing, e.g. DateFormat. (Already implemented by Apple on branch)

    • Locale service SPI

    • JSR-310 Date and Time APIs

    • @provider multiple version support, Calling old ICU service code through new ICU API

    • Java 5 migration (ICU4J)

    • Supporting generics to match JDK APIs

    • ICU 4.4 will no longer support Java 1.4 or older versions

    • Java Logging support (ICU4J)

    • ICU Resource Bundle footprint optimization