ICU - International Components for Unicode

Download ICU 59

ICU is the premier library for software internationalization, used by a wide array of companies and organizations.

Release Overview

ICU 59 upgrades to emoji 5.0 data, together with segmentation and bidi updates from Unicode 10 beta. The Java code for number formatting has been completely rewritten for reliability and performance. There is also a new case mapping API for styled text, and a technology preview of enhanced language matching.

The source code repository has been reorganized, creating a combined trunk with icu4c and icu4j (and tools) folders. (#12800)

There are major changes for ICU4C that require changes in projects using ICU. See below for details.

Please use the icu-support mailing list and/or ICU Trac for error reports.

List of tickets fixed in ICU 59

Common Changes

- Emoji 5.0 data (tickets #12900 & #13058)
  - Includes bidi data files from Unicode 10 beta.
  - Includes segmentation data files and rules from Unicode 10 beta and CLDR 31.0.1.
  - Does not yet include the Emoji_Component property.
  - Otherwise ICU 59 continues to use Unicode 9 data.

CLDR 31.0.1
- - Including updates for emoji 5.0, for example local names for England, Scotland, and Wales.
  - GMT and UTC are no longer unified, and CLDR provides distinct UTC display names, avoiding confusion with standard (winter) time in Britain.
  - See the CLDR download page for other CLDR features and migration issues in CLDR v31.

- New case mapping API (C++ & Java classes CaseMap) supports styled text (#12410 & #12988)

ICU4C Specific Changes

- ICU4C now uses and requires C++11 language features and libraries.
- ICU4C has also moved to char16_t as the type for UTF-16. This is a breaking change. Please see the detail section below.
- ICU4C source code files are now in UTF-8 and use non-ASCII characters, like ICU4J.
  - For Microsoft Visual Studio, the /utf-8 option is set in ICU's .vcxproj files.
  - For most other platforms, the build-time environment must be set to a UTF-8 locale.
  - For compilers that cannot handle UTF-8 source code, an escaper replaces non-ASCII characters with \uhhhh and \U00hhhhhh sequences.
  - As before, the runtime charset need not be UTF-8.
  - On platforms where the native charset is not UTF-8, including Windows, check the settings of your editor before working with the source.

ICU4J Specific Changes

- The Java code for number formatting (DecimalFormat etc.) has been rewritten to fix many bugs, improve performance, add new capabilities (still in technology preview), and make the code more structured and maintainable. (#7467)
- There is a technology preview of enhanced language matching (ticket #12812).

Known Issues

- ICU4C
  - ICU does not build with UCONFIG_NO_NORMALIZATION turned on; builds but tests do not pass with UCONFIG_NO_FILE_IO turned on (#13069).
  - Platform Issues:
    - IBM z: Test failures (#13095)
    - Solaris:
      - Unicode issues with some test files (#13096) (tests are not able to build)
      - Time zone detection errors (13097)
    - Windows:
    - When using "@compat=host", 6 locales have date and number formatting issues (#13119).
      - The UWP version of ICU will always fallback to the "en_US" locale (#13217).
      - The function uprv_convertToPosix has a pointer to a stack local from a destroyed scope (#13263).
      - Time zone detection issues on Windows 7 with non-English UI (#13826). Fixed in the "maint-59" branch.
  - Windows using the ICC compiler:
    - Compilation issues. A work-around is known and reported to succeed. (#13190)
    - Source File Encoding. The ICC compiler does not recognize the /utf-8 option. (#13251)
    - MinGW: Compilation and test issues. Fixed in trunk and a patch exists for version 59 (#13164).
- The .zip file is larger than it needs to be. This is because the new makedata_uwp target does not build using prebuilt data. (#13126)
- The .tgz file cannot be used to build with Cygwin/MSVC (#13139)

Migration Issues

Number Formatting (ICU4J)

The changes to number formatting can cause changes in behavior for some edge cases, which may affect "golden data" for some tests.

"#include what you use" (ICU4C)

Please "#include what you use" if possible. Unnecessary #includes are sometimes removed from ICU headers. This can break compilation of code that relies on indirect #includes. See https://include-what-you-use.org/

ICU4C char16_t

Issues listed below.

ICU4C char16_t

With the move to C++11, ICU4C has also moved to char16_t as the type for UTF-16 code units and string pointers.

This is a breaking change.

Why are we breaking your code?

- With C++11 finally supported by all ICU platforms, this is the first time that a C++ standard type is available for UTF-16, together with u"syntax for string literals" which have type const char16_t *.
- ICU's mission is to be the premier Unicode and i18n library, and we want ICU to “play nice” with the new standard C++ UTF-16 type and strings.

UChar typedef

ICU4C used to use the UChar typedef throughout. It is an unsigned 16-bit integer type.

The UChar typedef was compile-time-configurable, and its default definition depended on the platform. For example, it was usually defined to be uint16_t on Linux and macOS X, but wchar_t=WCHAR on Windows (for ease of use with Windows APIs and libraries).

In other words, portable code could not rely on a fixed definition of UChar.

ICU4C library and C++ test code now always uses UChar=char16_t.

For callers of ICU, UChar is now a typedef for char16_t by default on all platforms, but it continues to be compile-time-configurable.

For convenience during the transition, there is also a new typedef OldUChar with the same default, platform-dependent type definition as ICU 58 UChar. OldUChar is not compile-time-configurable. (For that, continue to configure and use UChar.)

For details see the documentation for UChar and OldUChar in unicode/umachine.h.

char16_t in C

In C, char16_t and uint16_t are identical types. wchar_t is a distinct type even if it is a 16-bit type (and thus bit-compatible). No type conversion is needed between char16_t * and uint16_t *, but it is needed between either of these and 16-bit wchar_t *.

Binary compatibility of C APIs is preserved because char16_t, uint16_t, and 16-bit wchar_t are bit-compatible, and the precise types do not affect the exported linker symbols. (Unlike C++ function name mangling.)

ICU C APIs continue to be declared using UChar. If necessary, code calling ICU C API can be compiled with UChar=wchar_t, for example for Windows.

char16_t in C++

In C++, the three types char16_t, uint16_t, and wchar_t (if 16 bits wide) are bit-compatible but “distinct”. Their pointers do not convert implicitly to each other.

ICU C++ API has never been binary compatible from release to release. We strive to keep C++ API source-compatible, but for this change this is not possible in all cases.

Most ICU C++ API functions take and return UnicodeString values. No changes there.

UnicodeString constructors that used to take [const] UChar * now have overloads for char16_t *, uint16_t *, and 16-bit wchar_t *.

In some C++ functions (UnicodeString and elsewhere), UChar pointers are replaced with values of new pointer-wrapper classes Char16Ptr or ConstChar16Ptr which have implicit conversions from the bit-compatible raw pointer types and are trivially copyable/movable.

UChar pointers could not be changed to [Const]Char16Ptr in some cases.

- Virtual functions were not changed so as not to disrupt subclassing.
- In some functions that were already overloaded, changing a raw pointer to a wrapper class would have made call sites ambiguous and thus break compilation anyway.

All remaining occurrences of UChar in public ICU C++ headers are replaced with char16_t.

- This ensures that the C++ API will link with the library code even if UChar is configured to some other type.
- The ICU libraries never need to be recompiled due to UChar configuration.

The effect of the overloads and pointer-wrapper classes is that a lot of C++ source code calling ICU C++ functions should continue to compile and work without change.

However, there will be cases where call sites need to be adjusted.

Pointer conversion

Explicit conversion between char16_t * and its sibling types will be necessary between ICU C and C++ APIs if UChar is configured to something different from char16_t, and between ICU APIs and ICU-using code until the latter is also migrated to char16_t.

The following classes and functions are defined in the new header file unicode/char16ptr.h.

It might be convenient to use a backport-to-ICU-58 version of the new unicode/char16ptr.h header file in order to make ICU-calling code work with both ICU 58 and ICU 59.

For conversion to [const] char16_t * use temporary instances of ICU's new pointer-wrapper classes ConstChar16Ptr or Char16Ptr:

UnicodeString s;

const UChar *reorderStart = ...; // or const uint16_t * etc.

const UChar *limit = ...;

s.setTo(ConstChar16Ptr(reorderStart), (int32_t)(limit-reorderStart));

For conversion to [const] UChar * call toUCharPtr() pointer conversion functions: (only if you configure UChar≠char16_t)

const char16_t *srcChars = ...;

int32_t srcLength = u_strlen(toUCharPtr(srcChars));

If you use your own typedef that is compatible with ICU 58 UChar, call toOldUCharPtr():

UnicodeString s;

char16 *p = toOldUCharPtr(s.getBuffer()); // char16 defined like OldUChar = ICU 58 UChar

For example, on Windows:

UnicodeString filename;

const UChar *p = filename.getBuffer(); // now by default UChar=char16_t

HANDLE file = CreateFile2(p, // pointer type mismatch

GENERIC_READ, FILE_SHARE_READ, OPEN_EXISTING, NULL);

→

UnicodeString filename;

const WCHAR *p = toOldUCharPtr(filename.getBuffer()); // explicit conversion to wchar_t *

HANDLE file = CreateFile2(p,

GENERIC_READ, FILE_SHARE_READ, OPEN_EXISTING, NULL);

Fixing call sites

We expect more and more C++ code in general to move to C++11 and its new UTF-16 type and literals.

Use class UnicodeString if possible, in particular its read-only-aliasing constructor, writable-aliasing constructor, etc. Use getBuffer(), getTerminatedBuffer() etc. with toUCharPtr() or toOldUCharPtr() as necessary.

Where compilation fails because conversion of NULL is now ambiguous, change it to nullptr: The NULL macro is compiler-dependent. It can be nullptr, or simply 0 (an int constant), or a zero integer with the same number of bits as a pointer (as in clang/gcc NULL=__null). Some of these become ambiguous when functions are newly overloaded.

Code that relies on a particular definition of UChar≠char16_t for its own use can configure UChar to that type. This will not affect ICU C++ API which now explicitly uses char16_t. However, passing pointers between ICU C and C++ APIs then requires explicit pointer conversion.

Code that relies on a particular definition of UChar≠char16_t for its own use can replace UChar with that type and add explicit conversions as necessary.

As an example, see the Chromium source changes to "Prepare Chromium and Blink for ICU 59", making the code work with both ICU 58 and ICU 59.

ICU4C Platform Support

All: Compiler support for C++11 is now required for building the ICU libraries.
- GCC: version 4.8 and later has been tested.
AIX: use xlC C/C++ for AIX 13.1.3 or later
macOS: XCode 8.3 ( LLVM clang 8.1.0 ) has been tested.
Solaris: use Oracle Developer Studio 12.5 or later
Windows:
- The minimum supported version is Windows 7. Windows XP and Windows Vista are no longer supported.
- Building the Visual Studio UWP projects requires Visual Studio 2015 Update 3 with a version of the Windows 10 SDK installed.
IBM z: use xlC C/C++ for IBM z/OS Version 2 Release 2 or later

Updates in ICU 59.2

- New Japanese era Reiwa (令和) support
- IANA tzdata2019a

ICU4C Download

Latest ICU4C 59 Release

Version: 59.2

Release Date: 2019-04-11

Source and binary downloads are available on the git/GitHub tag page: https://github.com/unicode-org/icu/releases/tag/release-59-2

Previous ICU4C 59 Releases

Version: 59.1

Release Date: 2017-04-14

API Changes since ICU4C 58
- - Changes of UChar→char16_t and [const] UChar→[Const]Char16Ptr are not shown.
Readme

ICU4J Download

Latest ICU4J 59 Release

Version: 59.2

Release Date: 2019-04-11

Source and binary downloads are available on the git/GitHub tag page: https://github.com/unicode-org/icu/releases/tag/release-59-2

Maven dependency:

</dependency>

Previous ICU4J 59 Releases

Version: 59.1

Release Date: 2017-04-14

Page updated

Report abuse