General-Purpose Compression
We have been thinking about using general-purpose compression for some time.
Pro:
Better compression ratio than custom compact encodings in many cases.
Handles arbitrary data, not just strings or integers.
Low maintenance when using well-tested open source libraries.
Con:
Too much overhead (time and memory) for very small amounts of data, such as short strings.
ICU4C data loading memory-maps data (strings, binary collation blobs, etc.). Decompression would require heap-allocating the uncompressed result, normally per process. Up-front decompression would waste time and space for unused data. Incremental decompression requires synchronization. See also "Runtime Issues for Compression" on the "Resource Bundle Strings" page.
Requirements:
Must be compatible with the ICU license.
Must be supported in Java, ideally via built-in java.util.zip: zlib, gzip or zip.
Recommendation:
It looks like zlib might be the best choice: It is open source with a notice-style license like ICU's, is free of patents, documented in RFCs (1950 & 1951) supported in java.util.zip, used in .jar and .png files, used in Apache, ...
Granularity
Compress whole .dat package. Pro: Might be best compression because it can optimize across pieces. Con: Need to decompress all of the data before accessing any of it.
Compress per data item/file, e.g. per resource bundle, except for the header. Pro: Could be used with any data item/file without modification to its internal format. The general data writing and loading code could handle compression and decompression. Con: If it is common for only parts of a data item/file to be accessed, such as in resource bundles, then the decompression may read and decompress much more data than necessary, and the per-data-item/file load time increases.
Compress per piece of data, e.g. per string in a resource bundle. Pro: Does not decompress what's not used, faster bundle loading. Con: Probably inefficient for short strings (which are most common). Need per-string synchronization for multithreading.