Monday September 18, 2006
Joseph D. Darcy's Sun WeblogJoseph D. Darcy's Sun Weblog
Iterating over the codepoints of a String Recently I wanted to iterate over the code points of a Therefore, a sequence of Previously, one way to iterate through the character values of a String was to look at each
String s = ...
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
// Process c...
}
Now, the chars need to be considered as possible members of a surrogate pair representing a single code point. Currently the canonical loop for this operation is:
String s = ...
for(int cp, i = 0; i < s.length(); i += Character.charCount(cp)) {
cp = s.codePointAt(i);
// Process cp...
}
At present, there is no direct API support for getting an iterator of the code point values from a String or CharSequence; perhaps one will be added in the future.
Glossary of Unicode terms (2006-09-18 16:21:22.0) Permalink Comments [2] Post a Comment: Comments are closed for this entry. |
Calendar
RSS Feeds
All /Annotation Processing /General /Java /JavaOne /Numerics /OpenJDK SearchLinks
NavigationReferersToday's Page Hits: 30 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
import static java.lang.Character.charCount; import static java.lang.Character.codePointAt; public class CodePointIterator implements Iterator<Integer> { private final char[] chars; private int next; public CodePointIterator(CharSequence cs) { chars = cs instanceof String ? ((String) cs).toCharArray() : cs.toString().toCharArray(); } public boolean hasNext() { return next < chars.length; } public Integer next() { if (next > chars.length) throw new NoSuchElementException(); int nc = codePointAt(chars, next); next += charCount(nc); return nc; } public void remove() { throw new UnsupportedOperationException(); } }or something similar. But there's a pretty heavy autoboxing overhead if your strings are large.Posted by Ian Phillips on September 19, 2006 at 01:00 PM PDT #
Extracting a String from the CharSequence is good to avoid issues with the argument CharSequence being mutated as it is iterated over. However, the instanceof check for String is unnecessary since String.toString() just returns this so the method is very fast.
If the characters of the character sequence are mostly in the ASCII range, then the autoboxing semantics will require cached Integer objects to be returned.
Posted by Joe Darcy on September 20, 2006 at 12:20 PM PDT #