New Syntax
Introduce new syntax, as follows. By default, auto-detect from the message pattern string whether it uses the old syntax or the new one.
Use digraphs to enclose both arguments and sub-messages: {|...|} instead of {...}.
ASCII digraphs for easy input, and easy string literals in non-Unicode source code.
Unusual combinations of characters that are rare in normal text.
Use the same digraphs for both arguments and sub-messages. Otherwise we believe that users will be confused. (And we would have to come up with a second set of low-collision digraphs.)
Make the plural/selectordinal remainder placeholder look like an argument: {|#|}
Allow white space just like with an argument, for consistency: {| # |}
In a choice argument style, enclose sub-messages also with {|...|} rather than the less-than operator and the single |
Allow white space before the {| and after the |} as in plural/select/selectordinal.
We retain the # or ≤ in the choice format so that the syntax and the MessagePattern.Part stream remain mostly the same.
Add two more MessagePattern.ApostropheMode values: (The enum name does not really fit the choices well.)
Existing DOUBLE_OPTIONAL (ICU 4.8 default behavior)
Existing DOUBLE_REQUIRED (JDK compatibility mode)
Proposed to be the default behavior in ICU 50.
Detects simply via pattern.contains("{|") or equivalent.
Note: The new syntax looks to an old parser like a syntax error, unless it is immediately preceded by a quote-starting ASCII apostrophe that does not solely enclose the curly brace: "... '{| ...' ..." -- this seems extremely unusual.
No mechanism for quoting/escaping in the new syntax.
Digraphs can be disrupted with space or U+200B ZWSP or similar.
New method ApostropheMode getResolvedApostropheMode()
If getApostropheMode()==HEURISTIC, then getResolvedApostropheMode() will return
HEURISTIC if no pattern has been parsed yet.
DOUBLE_OPTIONAL or SINGLE_AND_DIGRAPH_BRACES after a pattern has been parsed.
When parsing another pattern, the heuristic is applied again. (Hence the new method.)
Otherwise getResolvedApostropheMode() will return the same value as getApostropheMode()
C++ API signatures:
New UMessagePatternApostropheMode constants
New MessagePattern method
UMessagePatternApostropheMode getResolvedApostropheMode () const
ICU ticket: "more robust MessageFormat syntax"
With ApostropheMode.DOUBLE_OPTIONAL (ICU 4.8 default behavior):
In folder {folder_name} on {when, date, yyyy-MM-dd} there are
{num1, choice, 1<very few|4#at least four} files including
{num2, plural, offset:1 =0 {no} =1 {one} =2 {two}
other {one and # from group '#'3}} text files from
{gender, select, female {her} male{his}other{their}} collection.
With ApostropheMode.SINGLE_AND_DIGRAPH_BRACES (proposed ICU 50 behavior):
In folder {|folder_name|} on {|when, date, yyyy-MM-dd|} there are
{|num1, choice, 1< {|very few|} 4# {|at least four|}|} files including
{|num2, plural, offset:1 =0 {|no|} =1 {|one|} =2 {|two|}
other {|one and {|#|} from group #3|}|} text files from
{|gender, select, female {|her|} male{|his|}other{|their|}|} collection.
Problems with old syntax
Uses single-ASCII-character syntax which is prone to collide with normal text and requires quoting/escaping.
Unfortunately, we made this worse in plural arguments by using a single # sign for the number.
Quoting is done with ASCII apostrophes which definitely collides with normal text, although this got improved significantly in ICU 4.8.
... {num, plural, ... other{# fans call them their #1 band}} ... (with num=5 this becomes "5 fans call them their 51 band")
... {gender, select, male{He said 'No way!'} ...} ... (fails with syntax error: ASCII apostrophe before variant-ending } starts quoted literal text)
... {num, plural, =1{Seleccioneu un màxim d'# widget més} ...} ... (fails with syntax error: ASCII apostrophe before plural number replacement # starts quoted literal text)
Minor Issue
In some translation tools, it has been awkward to have sub-messages enclosed in {pairs of curly braces}. It would be slightly easier for those tools to instead terminate sub-messages at the next syntax token. On the other hand, the {pairs} are useful for avoiding white space characters when a message is broken into lines without enclosing each line in "quotes" or similar. (White space (line feeds, indentation) are not an issue in C++ or Java when messages are written in "string literals", but can be awkward if messages are in XML contents or similar.)
Use digraphs (pairs of characters) for syntax, to make syntax tokens very unlikely to collide with normal text.
Each digraph should be a very unusual (in human-readable text) combination of unusual characters.
For parsing, each of these characters should be Pattern_Syntax.
For typing, each of these characters should be an ASCII Pattern_Syntax character [\- , ; \: ! ? . ' " ( ) \[ \] \{ \} @ * / \\ \& # % ` \^ + < = > | ~ \$], although this conflicts with the goal of choosing unusual characters.
Maybe we could use Latin-1 Pattern_Syntax characters [¡ ¿ « » § ¶ © ® ° ± ÷ × ¬ ¦ ¤ ¢ £ ¥] as a compromise between developer input and avoiding "normal text characters".
See German Wikipedia about keyboard input of quotation marks.
Most syntax has begin-end tokens with further specs in between. At least in some of the simpler cases, we could make it so that only a completely syntactically correct begin-spec-end sequence is recognized as syntax, while any other sequence is just text (and does not cause a syntax error).
This would strongly reduce possible collisions with normal text.
However, it might complicate the parser, and it would turn actual syntax errors into visible text rather than reporting an error.
Arguments should be terminated with a digraph as well, so that the last sub-message in a complex argument is terminated with that rather than with a single character.
Maybe add an optional token like {.} for terminating a sub-message, or maybe {+} for generally ignoring following Pattern_White_Space? ({+} reminiscent of Java String concatenation.)
These would be useful if we chose to bracket selectors rather than sub-messages.
This seems to indicate that bracketing selectors is as problematic as bracketing sub-messages.
Maybe no quoting/escaping at all.
Or something sufficiently obscure like {'x'} where x is a single escaped Pattern_Syntax character. If someone really needs to write "$}" they could escape the $ and write "{'$'}}".
Or something that harmlessly breaks the digraph rather than quoting a character. For example, reserve some character that results in no output at all, and insert it between the two characters that we want to prevent from forming a digraph. Or insert a special token that has no output, such as {°} which could also serve as a "this is new syntax" indicator.
ICU transition:
We probably want an attribute on MessageFormat & MessagePattern.
It would be good if we could auto-detect the new syntax, so that users need not switch APIs.
For auto-detection, we might need a special token that just indicates the new syntax if we do not otherwise need arguments/placeholders, to suppress old-syntax quoting/escaping. Or else we ignore argument-less message patterns?
Sample: All-ASCII syntax, bracketing selectors
When selectors are bracketed, then a sub-message is terminated either by another selector or by the end-of-argument token.
Choice format selectors: {:1<:} for an exclusive limit (sub-message for values greater than 1) vs. {:4:} for an inclusive limit (sub-message for values greater than or equal to 4).
Plural format selectors: We could distinguish explicit values vs. keywords by looking for a leading digit (although ICU 4.8 allows any non-syntax, non-whitespace character in keywords), adding syntax inside the selector, or using different selector begin-end tokens. This sample adds an equals sign inside the explicit-value selector, similar to the ICU 4.8 syntax.
In folder {@folder_name@} on {@when, date, yyyy-MM-dd@} there are
{@num1, choice, {:1<:}very few{:4:}at least four@} files including
{@num2, plural, offset:1 {:=0:}no{:=1:}one{:=2:}two{+}
{:other:}one and {#} from group {{'#'}3}@} text files from
{@gender, select, {:female:}her{:male:}his{:other:}their {'@'}{'}'}@} collection.
Sample: Syntax with Latin-1, bracketing sub-messages
Sub-messages are {§bracketed like this§}. No syntax for "end sub-message" or "ignore following white space".
Choice format selectors: 1< for an exclusive limit (sub-message for values greater than 1) vs. 4 for an inclusive limit (sub-message for values greater than or equal to 4).
In folder {°folder_name°} on {°when, date, yyyy-MM-dd°} there are
{°num1, choice, 1< {«very few»} 4 {«at least four»}°} files including
{°num2, plural, offset:1 =0 {«no»} =1 {«one»} =2 {«two»}
other {«one and {¤} from group {{¦¤¦}3}»}°} text files from
{°gender, select, female {«her»} male{«his»}other{«their {¦§¦}{¦}¦}»}°} collection.
Same idea, different characters, no quoting/escaping syntax:
In folder «%folder_name%» on «%when, date, yyyy-MM-dd%» there are
«%num1, choice, 1< «{very few}» 4 «{at least four}»%» files including
«%num2, plural, offset:1 =0 «{no}» =1 «{one}» =2 «{two}»
other «{one and «#» from group #3%»%» text files from
«%gender, select, female «{her}» male«{his}»other«{their %#}»%» collection.
Sample: Syntax with trigraphs, bracketing sub-messages
Sub-messages are ~+{bracketed like this}+~. No syntax for "ignore following white space".
No syntax for quoting/escaping in this sample. Needed? Maybe ~%{}%~ as a new-syntax marker and trigraph breaker with no output by itself?
Choice format selectors: 1< for an exclusive limit (sub-message for values greater than 1) vs. 4 for an inclusive limit (sub-message for values greater than or equal to 4).
In folder ~%{folder_name}%~ on ~%{when, date, yyyy-MM-dd}%~ there are
~%{num1, choice, 1< ~+{very few}+~ 4 ~+{at least four}+~}%~ files including
~%{num2, plural, offset:1 =0 ~+{no}+~ =1 ~+{one}+~ =2 ~+{two}+~
other ~+{one and ~%{#}%~ from group #3}+~}%~ text files from
~%{gender, select, female ~+{her}+~ male~+{his}+~other~+{their xyz}+~}%~ collection.
Sample: Somewhat XML-ish
Argument is one of
"{arg" PWS+ arg-name PWS+ "/}"
"{arg" PWS+ arg-name PWS+ ":" PWS+ arg-type PWS+ "/}"
"{arg" PWS+ arg-name PWS+ ":" PWS+ arg-type PWS+ "}" arg-style "{/arg}".
where PWS=Pattern_White_Space.
Sub-message begins with "{msg" PWS+ selector "}" and ends with "{/msg}".
No syntax for quoting/escaping; needed?
Maybe optional {msg}...{/msg} around the whole message for syntax version disambiguation? Or just scan for presence of "{/arg}" and similar? Or use square brackets in new syntax? Or use "{$arg "..."{/arg$}"?
In folder {arg folder_name/} on {arg when:date}yyyy-MM-dd{/arg} there are
{arg num1:choice}{msg 1<}very few{/msg}{msg 4<=}at least four{/msg}{/arg} files including
{arg num2:plural}{offset:1}{msg =0}no{/msg}{msg =1}one{/msg}{msg =2}two{/msg}
{msg other}one and {plural_number/} from group #3{/msg}{/arg} text files from
{arg gender:select}{msg female}her{/msg}{msg male}his{/msg}{msg other}their xyz{/msg}{/arg} collection.
Original Sample
In folder {$folder_name$} on {$when, date, yyyy-MM-dd$} there are
{$num1, choice, {<1}very few{<=4}at least four$} files including
{$num2, plural, offset:1 {=0}no{=1}one{=2}two{+}
{:other}one and {#} from group {{'#'}3}$} text files from
{$gender, select, {:female}her{:male}his{:other}their {'$'}{'$'}$} collection.
Mark's comments on the original version & sample
Some thoughts on alternatives.
{%folder_name%} or {!folder_name!} or {$folder_name$}
{'<any char>'}
Also, we could use {% or to trigger new vs old syntax, and then have the same method....