
Monday September 18, 2006
JSR 269 in proposed final draft I'm happy to announce that JSR 269 has progressed to the proposed final draft stage of the JCP process. This draft corresponds to the version of JSR 269 implemented in build 98 of JDK 6. Please send comments to jsr-269-comments@jcp.org.
Changes from the public review draft include:
- Adding originating elements var-args parameters to the
Filer create methods to enable better management of dependencies.
- Changing the return type of the
getFooName methods to a separate Name extends CharSequence interface.
- Revised
Type visitor structure
- Included a description of the annotation processing discovery process.
- Various specification clarifications:
- More explicit
null pointer defaults.
- The modeling API is meant to be used for multiple purposes, including but not limited to annotation processing.
- More notes on anticipated evolution of the API.
(2006-09-18 17:44:45.0)
Permalink
Iterating over the codepoints of a String Recently I wanted to iterate over the code points of a String instead of its char values. Unicode 3.1 added supplementary characters, bringing the total number of characters to more than the 216 characters that can be distinguished by a single 16-bit char value. Therefore, a char value no longer has a one-to-one mapping to the fundamental semantic unit in Unicode. JDK 5 was updated to support the larger set of character values. Instead of changing the definition of the char type, some of the new supplementary characters are represented by a surrogate pair of two char values. To reduce naming confusion, a code point will be used to refer to the number that represents a particular Unicode character, including supplementary ones.
Therefore, a sequence of char values can be thought of as a variable-length encoding of a sequence of code points; the older characters (in the Basic Multilingual Plane) are represented by a single char value while the newer supplementary characters take two char values. The definition of language concepts, like identifiers, was rephrased in terms of code points instead of chars. The existing isFoo(char) methods in the Character class were augmented with isFoo(int) overload siblings using an int to store a code point value.
Previously, one way to iterate through the character values of a String was to look at each char value in turn:
String s = ...
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
// Process c...
}
Now, the
chars need to be considered as possible members of a surrogate pair representing a single code point. Currently the canonical loop for this operation is:
String s = ...
for(int cp, i = 0; i < s.length(); i += Character.charCount(cp)) {
cp = s.codePointAt(i);
// Process cp...
}
At present, there is no direct API support for getting an iterator of the code point values from a
String or
CharSequence; perhaps
one will be added in the future.
Glossary of Unicode terms
(2006-09-18 16:21:22.0)
Permalink