Resource Bundle Issues
Yoshito Umaoka sent the following to the ICU team on 2008-nov-05:
ICU resource bundle issues
Hi folks,
I was trying to fix some resource bundle issues and did some
investigations. Now I think there are several design issues in ICU
resource bundle implementation. I do not have any concrete plan for
making fixes/changes. For now I just want to bring some topics on table
and check with you if there are any misunderstandings/historic background
and etc.
1. Top level table vs. nested table
Locale bundles are organized based on locale inheritance model. Any
resources associated with a key is folded into the locale inheritance
hierarchy. A table is a collection of resource items associated with
keys. To access a resource item, you're supposed to use ures_getByKey to
locate the container resource. However, the behavior of ures_getByKey
with a top level table and a nested table is different. For a top level
table, locale inheritance is involved - that is, if a resource associated
with the given key is missing in the table, the implementation tries to
find a resource in its parent locale bundle. On the other hand, if
ures_getByKey is called against a nested table and a resource associated
with the given key is missing, an error code U_MISSING_RESOURCE_ERROR is
returned. To access a resource in nested table based on locale
inheritance, ures_getByKeyWithFallback must be used, but this is not a
public API.
For example -
----------------------
en {
month1 { "January"}
dow {
day1 { "Sunday" }
}
}
----------------------
en_US {
dow {
day2 { "Monday"}
}
}
----------------------
UResourceBundle *bundle = ures_open(MYBUNDLE, "en_US", &status);
bundle = ures_getByKey(bundle, "month1", bundle, &status);
This code will return a UResourceBundle whose type is URES_STRING with
"January" (picked from bundle "en").
UResourceBundle *bundle = ures_open(MYBUNDLE, "en_US", &status);
bundle = ures_getByKey(bundle, "dow", bundle, &status);
bundle = ures_getByKey(bundle, "day1", bundle, &status);
However, above code will return U_MISSING_RESOURCE_ERROR, because "dow" is
available in "en_US", but does not contain "day1".
I think there is no distinction between UResourceBundle returned by
ures_open (top level) vs. ures_getByKey (nested) in public API (isTopLevel
is set in the implementation although), the
difference of the behavior is not a good idea.
Steven told me that there was clear distinction between top level bundle
and nested bundle when ICU resource bundle was originally developed. And
he pointed out that some ICU consumer do not want any locale fallback in
nested resource access. However, we agreed that the default behavior of
nested items should be "with fallback". From the design point of view,
the fallback behavior should be governed by the mode when a bundle is
opened - i.e. ures_open (with fallback) vs. ures_openDirect (with NO
fallback).
2. Resource iteration
As I mentioned above, top level item look up is done with fallback.
However, iteration API - ures_getNextResource does not. From resource
bundle consumer's point of view, fallback should be transparent. In the
example above, "month1" resides in bundle "en". You can get the resource
by ures_getByKey with "month1" even you actually opened the bundle
"en_US". However, with ures_getNextResource, you cannot get there. If
the top level item does not fallback, it makes sense, but it does it.
3. Hierarchical key access
In our locale bundles, we organize the contents in hierarchical manner.
All items related to calendar is under "calendar". I think ICU consumer
also wants to organize their own resources in such way. With public APIs,
you have to navigate through the hierarchical node to reach to the actual
resource object, that is, you have to call ures_open first, then call
ures_getByKey to down to the desired leaf. I think we should have an API
locating a leaf object in one call, instead of repeative ures_getByKey.
For example, by specifying a hierarchical key -
"calendar/gregorian/dayNames/format/abbreviated", it should get to the
desired resource instead of calling ures_getByKey 5 times.
4. Alias resolution problem (ICU4J only?)
I mentioned this potential issue a while ago. Assume we have locale
bundles below -
----------------------
root {
calendar {
buddhist {
AmPmMarkers {"AM", "PM"}
dayNames:alias {"/LOCALE/calendar/gregorian/dayNames"}
}
}
}
----------------------
xx {
calendar {
gregorian {
dayNames {"SUN", "MON", "TUE", "WED", "THU", "FRI", "SAT"}
}
buddhist {
AmPmMarkers {"AA", "PP"}
}
}
----------------------
xx_YY {
calendar {
gregorian {
dayNames {"NUS", "NOM", "EUT", "DEW", "UHT", "IRF", "TAS"}
}
}
----------------------
To get calendar/buddhist/dayNames, there are two ways to get there.
a) getWithFallback with a hierarchical key
ICUResouceBundle b = ICUResourceBundle.getBundleInstance(MYLOCALEBUNDLE,
new ULocale("xx_YY"));
b = b.getWithFallback("calendar/buddhist/dayNames");
b) getWithFallback by navigating resource hierarchy
ICUResouceBundle b = ICUResourceBundle.getBundleInstance(MYLOCALEBUNDLE,
new ULocale("xx_YY"));
b = b.getWithFallback("calendar");
b = b.getWithFallback("buddhist");
b = b.getWithFallback("dayNames");
The major difference is - if you go down to navigate the hierarchy one by
one (case b), "/LOCALE" in dayNames:alias (root) is interpreted as locale
name "xx", while it is resolved as "xx_YY" with the single hierarchical
key "calendar/buddhist/dayNames" (case a). This is because the orignal
context for opening a bundle is not carried when you open a bundle one by
one.
5. ICU4J UResourceBundle extending java.util.ResourceBundle??
com.ibm.icu.util.UResourceBundle extends java.util.ResourceBundle. There
is one major design difference between these two classes - Java
ResourceBundle does not support hierarchical keys. On the other hand,
UResourceBundle supports key hierarchy. "get" in Java ResourceBundle
simply return a resource object, while "get" in UResourceBundle returns a
UResourceBundle.
By the contract, a subclass of Java ResourceBundle must implement Object
handleGetObject(String key). Resource bundle look up framework is a part
of Java ResourceBundle. So a subclass of JavaResourceBundle should be
just a container of resources (supporting string key map).
ICU4J UResourceBundle comes with its own bundle look up logic which is
independent from Java ResourceBundle. By the design of Java
ResourceBundle, a resource lookup is done by following "parent" chain.
But, with the current UResourceBundle implementation, "parent" always
point to its parent bundle's top level table. If a .res file has flat
structure (all keys belong to the top level table), it should be
equivalent to Java ResourceBundle. However, by the nature of
UResourceBundle, it usually has nested tables. So the basic contraction
in Java ResourceBundle is broken in UResourceBundle.
In my opinion, there is no benefit for implementing UResourceBundle as a
subclass of java.util.ResourceBundle for the reasons above.
UResourceBundle duplicates and extends the resource lookup framework
defined by Java ResourceBundle. Even someone wants to use .res via
java.util.ResourceBundle (for example, Java 6 allows you to implement your
own resource format using ResourceBundle.Control), it does not work quite
well unless .res files only contain top level keys.
6. ICU4J UResourceBundle (more specifically, ICUResourceBundle)
performance issues
Actually, this is the reason why I started look into ICU resource bundle
architecture closely. I think there are several implementation issues.
- Contents of .res is copied into byte[] when a bundle object is created.
- A new Java Object is created when a resource is requested.
- Keys are represented by ASCII bytes in byte[] and resource lookup is
done by b-tree search. (compact, but much slower than HashMap)
On the other hand, java.util.ListResourceBundle does followings -
- Java objects for keys/values are created when a bundle class is loaded.
- A reference of a Java Object is returned when a resource is requested.
- Keys are interned and HashMap is used for resource look up.
In ICU locale bundle, there are many string resources. When a string
resource is requested, it eventually copies bytes for creating String
object. Before ICU 4.0, UResourceBundle creates a new String object every
time for a same resource. We introduced bundle cache for tables at every
hierarchical level using HashMap in 4.0, so it no longer creates new Java
Object if the type is immutable. However, with this implementation,
UResourceBundle objects themselves are cached including various fields
other than actual resource data. Some fields are actually depending on
loading context - therefore, the implementation intentionally excludes a
ICUResourcebundle instance loaded with non-default options, or cached
incorrectly by an implementation bug and causes some problems. Also, even
all of resources are loaded into the caches, the byte[] keeping the entire
.res contents is never released. So technically, the implementation keep
duplicated data (one in the bundle caches, another in byte[]) in Java
heap.
To reduce the memory footprint, I was thinking about the use of memory
mapped file like ICU4C does. But, the contents of .res is loaded as a
resource stream via Java class loader - that is, it is not a regular file.
For this reason, we cannot use java.nio.MappedByteBuffer.
I think the right thing to do is (if we continue to use .res) -
- Contents of .res should be transformed Java Objects (for example, text
data -> String, int vector -> int[]...) when a bundle is created. Also,
String objects for keys are created/interned at the same time.
- Release byte[] holding the raw .res data. (Ideally, we want to read
.res as stream and transform the contents into native Java Object
structure in serialized manner. But .res format is not designed to
support stream parsing and it requires to load the entire cotents into
byte[] once...)
- UResourceBundle/ICUResourceBundle should work like a simple cursor for
navigating Java resource objects loaded above.
- HashMap is created when a resource is requested in a table resource.
By the implementation above, it has to pay one time cost at the loading
time. However, it should not be much different from Java standard
resource types (loading a ResourceBundle class also creates Java Objects
from bytecode.)
7. Resource accessor method for mutable resource types in ICU4J
UResourceBundle
UResourceBundle has some methods returning mutable Java types below -
public ByteBuffer getBinary()
public byte[] getBinary(byte[] ba)
public int[] getIntVector()
public String[] getStringArray()
Technically, resources are read-only and should not be modifiable. So,
with these method signatures, the implementation must create a new array
every time. Even we want to cache a resource object, it should be cloned
when the resource is accessed via these methods. This is actually a
problem of Java language (no const array type). I think we should add
read-only array wrapper type (and read-only ByteBuffer for the first one)
and use these types.
In addition to this, I think "public byte[] getBinary(byte[] ba)" does not
make sense. Even you provide fillin arg - byte[] ba, it cannot be used if
the size of bytes does not match the given byte[]. I think it was ported
from C, which takes length for input/output arg.
8. Bad APIs in ICU4J UResourceBundle
This is actually my fault. Just before Ram left the proejct, he was
working on moving some internal APIs (ICUResourceBundle) to public
(UResourceBundle). He forgot to assign some API status comments and I
added them without understanding the entire picture. Now I think I should
not have made following APIs public (I meant - public APIs, the signature
is actually "protected"). These APIs takes HashMap as an argument to
avoid cyclic references caused by invalid ALIAS, which is really our
internal implementation.
protected UResourceBundle handleGet(String aKey,
HashMap table,
UResourceBundle requested)
protected UResourceBundle handleGet(int index,
HashMap table,
UResourceBundle requested)