Joseph D. Darcy's Sun Weblog

Joseph D. Darcy's Sun Weblog


20100205 Friday February 05, 2010

Recognizing all valid integral strings with regular expressions

For a fun Friday hack, this blog entry describes a program to generate a regular expression to recognize all the valid strings of integers in a given base, that is, a regular expression (regex) which accepts all in-range strings but rejects strings that would cause an overflow if converted.

This sort of regular expression could be used to verify a string won't cause a NumberFormatException when processed by one of the many text → number methods in the Java platform, such as Integer.parseInt(String s, int base). However, using a regular expression may be more expensive than attempting the conversion and catching the exception on invalid input; the regular expression discussed is possibly of more academic interest than practical significance!

An analogous regular expression for floating-point strings has been published in the javadoc for several releases.

First, is the set of strings in a given base that will convert to a 32-bit or 64-bit integer a regular language, a language that can be recognized by a regular expression? Yes, because ignoring sequences of leading zeros for a moment, there are only a finite number of strings that can be converted to integer values, one string for each of the 232 or 264 values in the int or long types, respectively. All finite languages are regular, strings of zero of more leading zeros are regular, and concatenations of regular languages are regular. Putting those facts together, the entire set of convertible strings forms a regular language. This reasoning is not dependent on the base so it holds for all integer bases, including bases 2 through 36 supported by Java's libraries.

A core regular expression of billions of strings offered as alternatives ("1|2|3|...|2147483647") while fine from a mathematical perspective, would be too slow and awkward to use. Fortunately, the pattern of valid strings has sufficient structure to yield reasonably short and manageable regular expressions.

To start we will only consider decimal strings; additionally, we will assume only ASCII digits need to be recognized. Integer values range from

-2147483648

to

 2147483647

These strings both have 10 digits. Putting aside leading zeros for the moment, all strings of 9 or fewer digits are valid, with or without a leading minus sign. A regex to recognize strings with 9 or fewer digits is trivial; less obvious is how to specify the "ragged" 10-digit nonzero strings which are in range.

The core regular expression covering non-ragged inputs is

"(-)?"+			// optional leading minus sign
"0*"+			// leading zeros
"(\p{Digit}{1,9})"	// numbers with 1 to 9 digits

For the 10-digit strings, if the leading digit is less than the maximum digit, all other digits can have any value:

1(\p{Digit}{9})

If the first digit is at it's maximum, then if the second digit is less than its maximum, the third and subsequent digits can have any value:

20\p{Digit}{8}

Likewise, if the first and second digits are at their maximum, then if the third digit is less than its maximum, the fourth and subsequent digits can have any value. Continuing this pattern for all the digit positions:

1(\p{Digit}{9})|
20(\p{Digit}{8})|
21[0-3](\p{Digit}{7})|
214[0-6](\p{Digit}{6})|
2147[0-3](\p{Digit}{5})|
21474[0-7](\p{Digit}{4})|
214748[0-2](\p{Digit}{3})|
2147483[0-5](\p{Digit}{2})|
21474836[0-3](\p{Digit}{1})|
214748364[0-7]

This regular expression can be ORed with the regex for strings of 1 to 9 digits listed above. Finally, a separate case for the most negative value must be added:

-0*2147483648

All together, with some newlines for readability this yields:

((-)?0*((\p{Digit}){1,9})
|([0-1]\p{Digit}{9})
|(20\p{Digit}{8})
|(21[0-3]\p{Digit}{7})
|(214[0-6]\p{Digit}{6})
|(2147[0-3]\p{Digit}{5})
|(21474[0-7]\p{Digit}{4})
|(214748[0-2]\p{Digit}{3})
|(2147483[0-5]\p{Digit}{2})
|(21474836[0-3]\p{Digit}{1})
|(214748364[0-7])
)|
(-0*2147483648)

The ragged pattern for nonzero strings with maximum possible length will exist for bases that aren't powers of 2, for the Java libraries that includes all conversion radixes other than 2, 4, 8, 16, and 32. For powers of two, the pattern is more regular. For example, in base 16 the values range from

-80000000

to

7fffffff

so for base 16 the regular expression can simply be

((-)?0*(\p{XDigit}{1,7}|
	[0-7](\p{XDigit}{7}))|
(-0*80000000)

Generalizing the regular expression generation algorithm to any given base:

  1. Create a regex for an optional leading minus sign ("-") and zero or more leading zeros, "0*."

  2. For the base in question, generate the strings for MIN_VALUE and MAX_VALUE.

  3. Find n, the length of the MAX_VALUE string. Signed strings of that base with (n-1) or fewer characters are all valid inputs; create a regex recognizing (n-1) or fewer digits.

  4. Create a regex to recognize valid strings of length n.

  5. Combine the above regular expressions.

  6. To the above, concatenate a separate regex for the most negative value,

    -(0*)MIN_VALUE_STRING

Taken together, this regex will recognize exactly those strings convertible to the base in question.

Untested code implementing this algorithm is available for your viewing pleasure. Any further debugging, tuning, and enhancements are left as "an exercise for the reader," happy hacking!

(2010-02-05 08:34:00.0) Permalink Comments [1]

20100203 Wednesday February 03, 2010

java.util.Objects and friends

A small project I worked on during JDK 7 milestones 05 and 06 was the introduction of a java.util.Objects class to serve as a home for static utility methods operating on general objects (6797535, 6889858, 6891113). Those utilities include null-safe or null-tolerant methods for comparing two objects, computing the hash code of an object, and returning a string for an object, operations generally relating to the methods defined on java.lang.Object.

The code to implement each of these methods is very short, so short it is tempting to not write tests when adding such methods to a code base. But the methods they aren't so simple that mistakes cannot be made; replacing such helper methods with a common, tested version from the JDK would be a fine refactoring.

The current set of public methods in java.util.Objects is:

  • static boolean equals(Object a, Object b)

  • static boolean deepEquals(Object a, Object b)

  • static <T> int compare(T a, T b, Comparator<? super T> c)

  • static int hashCode(Object o)

  • static int hash(Object... values)

  • static String toString(Object o)

  • static String toString(Object o, String nullDefault)

  • static <T> T nonNull(T obj)

  • static <T> T nonNull(T obj, String message)

The first two methods define two equivalence relations over object references. Unlike the equals methods on Object, the equals(Object a, Object b) method handles null values. That is, true is returned if both arguments are null or if the first argument is non-null and a.equals(b) returns true. A method with this functionality is an especially common utility method to write, there are several versions of it in the JDK, so I expect the two-argument equals will be one of the most heavily used methods in the Objects class.

The second equivalence relation is defined by the deepEquals method. The equals and deepEquals relations can differ for object arrays; see for the javadoc for details. Equality implies deep equality, but the converse is not true. For example, in the program below arrays c and d are deep-equals but not equals.

public class Test {
   public static void main(String... args) {
       Object common = "A string in common.";
       Object[] a = {common};
       Object[] b = {common};
       Object[] c = {a};
       Object[] d = {b};
       // c and d are deepEquals, but not equals
   }
}

A third equivalence relation is the object identity relation defined by the == operator on references, but since that is already built into the language, no library support is needed. Identity equality implies equals equality and deepEquals equality.

Next, Objects includes a null-tolerant Comparator-style method which first compares for object identity using == before calling the provided Comparator. While Comparable classes aren't as widely available as the methods inherited from java.lang.Object, Comparable is a very useful and frequently implemented interface.

Objects has two hash-related methods. The first is a null-handling hash method which assigns null a zero hash code and the second is a utility method for implementing a reasonable hash function for a class just by passing in the right list of values.

The toString methods provide null handling support, in case of a null argument either returning "null" or the provided default string.

Finally, there are two methods to more conveniently handle null checks, intended to be useful when validating method and constructor parameters.

Taken together, the methods in Objects should lessen the pain and tedium of null handling until more systematic approaches are used.

The Objects API was shaped by discussion in various threads on core-libs-dev in September and October 2009. Several other bugs were also fixed as a result of those discussions, one adding a set of compare methods for primitive types (6582946) and another to consistently define the hash codes of the wrapper classes (4245470).

(2010-02-03 10:58:27.0) Permalink Comments [1]

20100202 Tuesday February 02, 2010

JDK 7: New Component Delivery Model Delivered

Thanks to Kelly, the new component delivery model for jaxp and jax-ws is now available in both JDK 7, as of build 72 of milestone 5, and OpenJDK 6, coming in build 18 (6856630).

As described previously, the JDK build no longer tracks a copy of the jaxp and jax-ws sources under version control. Instead source bundles from the upstream teams are used. The jaxp.properties file in the jaxp repository contains the default URL from which the source bundle is downloaded as well as the expected checksum for that file. The analogous setup is used for jax-ws in its repository. To avoid downloading another copy of a bundle or to try out an alternate bundle, several variables can be set in the ant build of one of the repositories. For jaxp,

jaxp-repo-directory$ ant -f build.xml \
-Ddrops.master.copy.base=path-to-drop-directory \
-Djaxp_src.bundle.name=name-of-source-zip-bundle-in-drop-directory \
-Djaxp_src.bundle.md5.checksum=md5-of-source-zip-bundle

If changes local to the JDK are needed, patches can be applied from the new patches directory in the two repositories. For example, patches are a mechanism that could be used to deploy security fixes until a new source bundle with those fixes was externally available.

With this new delivery model, I look forward to low-overhead and coordinated updates to jaxp and jax-ws in OpenJDK 6 and JDK 7.

A possible future consolidation would fold the build logic in the now vestigial jaxp and jax-ws repositories into the main jdk repository.

(2010-02-02 18:03:31.0) Permalink Comments [0]

20100201 Monday February 01, 2010

Purging LD_LIBRARY_PATH

Java developers are familiar with dynamic linking. Class files are a kind of intermediate format with symbolic references. At runtime, a class loader will load, link, and initialize new types as needed. Typically the full classpath a class loader uses for searching will have several logically distinct sub components, including the boot classpath, endorsed standards, extension directories, and the user-specified classpath. The manifest of a jar file can also contain Class-Path entries. Together, these paths delineate the boundaries of "jar hell."

For many years, modern Unix systems have also supported dynamic linking for C programs. Instead of a classpath, there is a runpath of locations to look to for resolving symbolic references. Like the classpath, the full runpath has multiple components, including a default component for system facilities (analogous to boot classpath), a component stored in a shared object (analogous to jar file Class-Path entries), as well as an end-user specified component (analogous to the -classpath command line option or CLASSPATH environment variable). The details of linking on Solaris are well explained in Sun's Linker and Libraries Guide. Other contemporary Unix platforms like Linux and MacOS have similar facilities, although the details of the various commands differ.

One of the tasks the JDK's launcher has handled is setting a suitable runpath for the JVM and platform libraries. Historically a runpath was needed to link in the desired JVM, such as client or server, and other system libraries. The client JVM and the server JVM are separate shared objects which support the same set of interfaces; by interpreting the command line flags the launcher selects which JVM to link in. Operationally, the linking is initiated by the Unix dlopen library call. So that the caller of the java command did not need to set LD_LIBRARY_PATH, after selecting the JVM to run the launcher would modify the LD_LIBRARY_PATH environment variable by prepending the path to the JVM shared object (and paths to other directories with JDK native system libraries). However, the runtime linker only reads the value of LD_LIBRARY_PATH when a process starts. Therefore, to have the new value take effect, the launcher would call an exec-family system call to start the process anew. Such re-execing to set LD_LIBRARY_PATH is not recommended practice on Unix systems.

The re-execing to set LD_LIBRARY_PATH had a number of unpleasant consequences in the launcher code. There is only a narrow path to pass information between the exec parent and the exec child, such as by modifying environment variables, which is generally discouraged. To decide whether or not an exec was needed, the launcher checked whether the prefix of LD_LIBRARY_PATH had the expected value; if it did, no exec was done for that purpose and infinite exec loops were avoided. Presetting LD_LIBRARY_PATH to the right value before calling java could thus be used to suppress the exec. There were also complications with correctly supporting multiple LD_LIBRARY_PATH variables on Solaris1 and handling suid java executions on Linux.2

The proper way to accommodate such dependencies is not to set LD_LIBRARY_PATH but rather to use the runtime linker facilities analogous to jar file Class-Path entries; the facility is the $ORIGIN dynamic string token for the runtime linker. As the name implies, $ORIGIN is expanded to the path directory of the object file in question; thus relative paths to other directories can be specified. Therefore, as long as the directory structure of the JDK and JRE are known, $ORIGIN can be used to record any necessary dependencies.

For some time, the JDK build has actually used $ORIGIN in creating its native libraries. Therefore, it may have been the case that LD_LIBRARY_PATH was not actually needed. However, verifying that LD_LIBRARY_PATH was not actually needed would require building an exec-free JDK on all supported Unix platforms and running tests that exercise the all libraries in the directories no longer added to LD_LIBRARY_PATH. The engineering for Kumar's purge of execing for LD_LIBRARY_PATH was generally straightforward: deleting the the LD_LIBRARY_PATH-related code in the Unix java_md.c file and doing builds on all platforms. Most of the effort of getting this fix back involved running tests to verify everything still worked. The testing revealed an unneeded, troublesome symlink that was removed at the same time LD_LIBRARY_PATH usage was purged.

While the launcher no longer execs to set the LD_LIBRARY_PATH, there are still cases where an exec will occur for other reasons. If the java command is requested to change data models using the -d32 or -d64 flag, that is, a 32-bit java command is asked to run a 64-bit JVM or vice versa, an exec is needed to effect the change. Also, multiple JRE support, where a different version is requested via the -version:foo flag, will also cause an exec if a different Java version needs to be run. However, before Kumar's fix the common case was that the launcher would exec once; now the common case is that the launcher will exec zero times.3

I'm very happy this messy use of LD_LIBRARY_PATH has finally been removed in JDK 7. The removal makes the launcher code both simpler and more maintainable. Unless your use of java relies on the number of execs that occur, the change should be largely transparent, other than startup being marginally faster. One situation to be aware of is launching a LD_LIBRARY_PATH-free JDK 7 java command from a JDK 6 or earlier java process. If the LD_LIBRARY_PATH variable of the older JDK is not cleared, it can affect the liking of the JDK 7 process.


1 Since Solaris 7, that OS line has supported three LD_LIBRARY_PATH variables:

  • LD_LIBRARY_PATH_32: if set, overrides LD_LIBRARY_PATH for 32-bit processes.

  • LD_LIBRARY_PATH_64: if set, overrides LD_LIBRARY_PATH for 64-bit processes.

  • LD_LIBRARY_PATH: used by both 32-bit and 64-bit processes is not overridden by a data model specific variable.

On Solaris, back in JDK 1.4.2 I fixed the launcher to properly take into account all three variables (4731671); on re-exec the data model specific environment variable is unset and LD_LIBRARY_PATH contains the old data model specific value prepended with the JDK system paths. Tests to verify all this used to live in and around test/tools/launcher/SolarisDataModel.sh, but they have thankfully been deleted as they are no longer relevant.

2 For suid or sgid binaries, LD_LIBRARY_PATH is handled differently to avoid security problems. While the Solaris runtime linker applies more scrutiny to LD_LIBRARY_PATH in this case, on Linux glibc sets LD_LIBRARY_PATH to the empty string. Since the empty string will not contain the expected JDK system directories, the prefix-checking logic detected this case to avoid an infinite exec loop (4745674). Running java suid or sgid isn't necessarily recommended, but it is possible. To actually resolve linking dependencies for such binaries, OS-specific configuration may be needed to add JDK directories to the set of trusted paths.

3 Before my batch of launcher fixes in JDK 1.4.2, the number of execs was even more varied. Specifying a different data model would exec twice, once to change the data model and again to set the LD_LIBRARY_PATH for that data model. From JDK 1.4.2 until the purge of LD_LIBRARY_PATH, the launcher used a single exec to set the LD_LIBRARY_PATH to the target data model (4492822).

(2010-02-01 11:24:15.0) Permalink Comments [3]

20100125 Monday January 25, 2010

What is the launcher?

One surprisingly tricky piece of the Java platform is the launcher, the set of C code that uses the JNI invocation API to get the JVM started and begin running the main class. While conceptually simple, the launcher is complicated by straddling the boundary between the host system and the JVM, often wrestling with native platform issues like thread configuration that need to be managed before starting the JVM. The launcher's tasks include selecting which VM to run (client, server, etc.) and running in the requested data model, 32-bit or 64-bit.

In the jdk Mercurial repository, the source code of the launcher is primarily composed of:

  • src/share/bin/java.{c, h}

  • Other files in the src/share/bin directory, including the Java launcher infrastructure utilties, jli_util.{c, h}.

  • src/solaris/bin/java_md.{c, h} (covers both Solaris and Linux using #defines)

  • src/windows/bin/java_md.{c, h} (covers various and sundry versions of MS Windows)

The launcher code is used to build the executables in the JDK bin directory; every invocation of java and commands like javac first executes through the launcher. Consequently mistakes in the launcher can cause severe build problems and cross-platform testing is especially important. On doing numerical work, Prof. Kahan has advised "The best you can hope for is that if you do your job very well, no one will notice" and working on the launcher has a similar flavor; you don't want your work to get noticed!

From JDK 1.4.2 through JDK 6 I was the lead launcher maintainer, amongst other responsibilities. When I first took over launcher maintenance, I introduced regression tests and performed several rounds of refactoring. Over time, my launcher activities shifting to reviewing and advising others on their launcher-related projects including:

  • In JDK 5, multiple JRE support (java -version:foo to invoke Java version foo from some other version) and ergonomics to provide better out-of-the-box tuned performance.

  • In JDK 6, the splashscreen functionality and classpath wildcards.

Since JDK 6 first shipped, I've happily handed over the reigns of primary launcher care and feeding to Kumar, while still being involved in code reviews and writing the occasional blog entry. Kumar moved common functinality of the launcher into a shared library to make the source more robust and speed up the builds.

In the near future I'll be writing about a long-standing launcher flaw concerning the Unix exec system call and LD_LIBRARY_PATH Kumar has fixed in JDK 7.

(2010-01-25 09:10:00.0) Permalink Comments [2]

20091130 Monday November 30, 2009

Projec Coin: Post-Devoxx Update, closures and exception handling

As has been announced recently at Devoxx and covered in various places, including threads on the coin-dev mailing list, Mark Reinhold made several announcements about JDK 7 at this year's Devoxx:

  1. JDK 7 will have a form of closures.

  2. The JDK 7 schedule is being extended to fall 2010.

On the first announcement, the coin-dev list is not the appropriate forum to discuss closures in Java. Closures are hereby decreed as off-topic for coin-dev.

Mark's blog entry "Closures for Java" invites those with an informed opinion to participate in the current discussion; watch Mark's blog for news about creation of a new list or project, etc., to host this closures effort.

On the second announcement, while the JDK 7 schedule has been extended, many of the current final five (or so) Project Coin features have not yet been fully implemented, specified, and tested. Therefore, there will not be a general reassessment of Project Coin feature selection or another call for proposals in JDK 7. The final five (or so) proposals remain selected for inclusion in JDK 7 and work will continue to complete those features. However, given its technical merit and the possibility of providing useful infrastructure for ARM, improved exception handling is now being reconsidered for inclusion in JDK 7. No other "for further consideration" proposal is under reconsideration.

(2009-11-30 10:09:22.0) Permalink Comments [8]

20091118 Wednesday November 18, 2009

Project Coin at Devoxx 2009

Wednesday evening local time I gave a talk about Project Coin at Devoxx 2009. The video of the talk will be available on Parleys in due course; in the mean time, my slides are online. Temping the wrath of the demo gods, I enjoyed showing the support Project Coin features have today in a developer build of NetBeans.

(2009-11-18 16:39:28.0) Permalink Comments [0]

Project Coin: Milestone 5 Language Features in NetBeans

To go along with the language changes available in JDK 7 milestone 5, the NetBeans team has created a developer build of NetBeans supporting the same set of language changes, including improved integer literals, the diamond operator, and strings in switch.

In addition to just accepting the new syntax, the NetBeans build has some deeper support too. For example, when auto-completing on a constructor with type arguments, the diamond operator is offered as a completion. To see what bounds were computed for a diamond instance, you can just hit Ctrl and click on the constructor; the bounds will appear in a pop-up. To encourage use of strings in switch, NetBeans will recognized the pattern of "if(s.equals("foo") {...} else if (s.equals("bar)..." and offer to transform that into strings in switch, just click on the "Convert to switch" lightbulb on the left hand side.

This NetBeans build with Coin support is based on NetBeans 6.8, but when 6.8 ships later this year it will not include support for Project Coin features. Project Coin support will be included in subsequent NetBeans releases.

(2009-11-18 16:08:07.0) Permalink Comments [0]

20091104 Wednesday November 04, 2009

Project Coin: Anatomy of adding strings in switch to javac

Earlier this week, I happily pushed an implementation of Project Coin's strings in switch language feature into a repository being used for JDK 7 milestone 5. JDK 7 binaries with strings in switch will be available for downloading in due course.

The javac compiler uses the standard compiler architecture of having successive phases and adding strings in switch required modifications to several links in the chain of phases.

Since JDK 1.4.x, javac has been multiple compilers in one since it has supported multiple source settings to specify what version of the language to use. (But when cross-compiling to an older language version, make sure to set the bootclasspath!) Information about which features are supported in a given version is encapsulated in javac's class com.sun.tools.javac.code.Source. When other parts of the compiler need to query if a feature is supported, they call a boolean method like Source.allowFeature(). Strings in switch added another method to this class; in diff notation:

+ public boolean allowStringsInSwitch() {
+ return compareTo(JDK1_7) >= 0;
+ }

The text of a source file being compiled is first turned into an abstract syntax tree (AST). The lexing and parsing processes which create the trees only verify the syntactic structure of the code. Semantic checks are done in subsequent phases of the compiler; many of those type checks are done during "attribution" by code in com.sun.tools.javac.comp.Attr. For switch statements, the constant-ness of case labels, having no repeated labels, and the labels' types matching the type of the expression being switched on are all handled in Attr. Attr also enforces the type restrictions on the switch expression; the expression can have an integer type, an enum type (as of JDK 5), and now a string type (as of JDK 7).

Because the trees in javac were sufficiently abstract, modifying Attr to support strings in switch was very straightforward. Most of the code was just for adding the allowStringsInSwitch check and generating an error message specific for a non-constant string. As shown in the patch below, no code implementing the actual semantic checks for non-repeated labels, etc. had to be adjusted to support strings in switch; all the existing checking logic was reused without modification.

--- a/src/share/classes/com/sun/tools/javac/comp/Attr.java	Thu Aug 27 13:40:48 2009 +0100
+++ b/src/share/classes/com/sun/tools/javac/comp/Attr.java	Mon Nov 02 21:36:59 2009 -0800
@@ -115,6 +115,8 @@ public class Attr extends JCTree.Visitor
         allowBoxing = source.allowBoxing();
         allowCovariantReturns = source.allowCovariantReturns();
         allowAnonOuterThis = source.allowAnonOuterThis();
+        allowStringsInSwitch = source.allowStringsInSwitch();
+        sourceName = source.name;
         relax = (options.get("-retrofit") != null ||
                  options.get("-relax") != null);
         useBeforeDeclarationWarning = options.get("useBeforeDeclarationWarning") != null;
@@ -166,6 +168,16 @@ public class Attr extends JCTree.Visitor
      * API warnings.
      */
     boolean enableSunApiLintControl;
+
+    /**
+     * Switch: allow strings in switch?
+     */
+    boolean allowStringsInSwitch;
+
+    /**
+     * Switch: name of source level; used for error reporting.
+     */
+    String sourceName;
 
     /** Check kind and type of given tree against protokind and prototype.
      *  If check succeeds, store type in tree and return it.
@@ -886,7 +898,15 @@ public class Attr extends JCTree.Visitor
         boolean enumSwitch =
             allowEnums &&
             (seltype.tsym.flags() & Flags.ENUM) != 0;
-        if (!enumSwitch)
+        boolean stringSwitch = false;
+        if (types.isSameType(seltype, syms.stringType)) {
+            if (allowStringsInSwitch) {
+                stringSwitch = true;
+            } else {
+                log.error(tree.selector.pos(), "string.switch.not.supported.in.source", sourceName);
+            }
+        }
+        if (!enumSwitch && !stringSwitch)
             seltype = chk.checkType(tree.selector.pos(), seltype, syms.intType);
 
         // Attribute all cases and
@@ -909,7 +929,8 @@ public class Attr extends JCTree.Visitor
                     Type pattype = attribExpr(c.pat, switchEnv, seltype);
                     if (pattype.tag != ERROR) {
                         if (pattype.constValue() == null) {
-                            log.error(c.pat.pos(), "const.expr.req");
+                            log.error(c.pat.pos(),
+                                      (stringSwitch ? "string.const.req" : "const.expr.req"));
                         } else if (labels.contains(pattype.constValue())) {
                             log.error(c.pos(), "duplicate.case.label");
                         } else {

Two new error messages were added for strings in switch. The first reports that a constant string expression is required rather than just a constant expression. The second reports that a string switch statement has been found in a source level where that feature is not allowed. The javac convention for this kind of error message is to report the source level being used in the current compile and the source level where the structure starts being supported:

+compiler.err.string.const.req=\
+    constant string expression required

+compiler.err.string.switch.not.supported.in.source=\
+    strings in switch are not supported in -source {0}\n\
+(use -source 7 or higher to enable strings in switch)

The bulk of the compiler changes to support strings in switch were made in com.sun.tools.javac.comp.Lower, the compiler phase which translates away syntactic sugar, lowering structures to trees implementing the formerly sugar-coated functionality. For example, Lower already had a method to translate enum switches into a switch on an integer value retrieved from an enum → int map. The initial strings in switch implementation uses a similar technique: a single string switch in source code is lowered into a series of two switches. The first switch is a new synthesized switch on the string's hash code, which gets mapped to a label's ordinal position on the list of case statements. The second switch is structurally identical to the original string switch from the source except that the string case labels are replaced by integer positions and the computed position from the synthesized switch is the expression being switched on.

Discussion of various design issues and implementation alternatives can be found in the full code changes in Lower; see the webrev of the changeset for details. The chained series of switch statements to implement strings is switch is a refinement of the implementation approaches outlined in the original Project Coin strings in switch proposal.

Various smaller implement points about the code in Lower are worth noting too.

A pair of maps are built to hold information about translating string case labels → positions and hash codes → case labels with that hash. In order to make the compiler's output predictable and repeatable, the maps and sets used in these data structures are LinkedHashMaps and LinkedHashSets rather than just HashMaps and HashSets. In terms of functional correctness of code generated during a given compile, using HashMap and HashSet would be fine; the iteration order does not matter. However, we find it beneficial to have javac's output not vary based on implementation details of system classes.

The code synthesis in Lower uses a compiler-internal tree maker API. The resulting trees are unprocessed; the various consistency properties and derived information calculated by earlier compiler phases are not directly enforced. The code in Lower is responsible for generating correct code and computing necessary derived information; if this is not done correctly, later phases like Gen, which translates trees into byte code, can fail. For example, trees have an associated source position; this is preserved in debugging information. The position for the initial synthesized switch is set to the source position of the start of the original switch statement. The tree node for a break statement stores information about the target of the break, which is another tree node. Therefore, break statements synthesized as part of the initial switch need this target information to be set and any break nodes in from original switch statement need to have their targets reset to the newly generated enclosing switch where integer labels have replaced the string ones. This target patching logic gets thoroughly exercised by the tests with nested strings in switch statements!

The new tests verify that the proper structural checks are enforced for string switches as well as verify that the proper execution paths are taken on different inputs for switches with a variety of control flow shapes, including multiple case labels, case labels with colliding hash codes, and nested switches.

Now that strings in switch are specified, implemented, and tested, I'm looking forward to working on the remaining Project Coin features on the final acceptance list.

(2009-11-04 14:21:36.0) Permalink Comments [3]

20090923 Wednesday September 23, 2009

Java Posse #279: A View from an Eggshell-Colored Clock Tower

Dick Wall and the rest of the Java Posse graciously offered to interview Alex Buckley and me one evening at the JVM Language Summit last week. Our discussion about general evolution of the Java programming language, including Project Coin and reactions to previous podcasts and Google group threads, is now available as episode #279 of the Java Posse podcast.

(2009-09-23 13:10:29.0) Permalink Comments [1]

20090914 Monday September 14, 2009

Java Posse #277 Feedback: Still not a view from an ivory tower

A follow-up entry to Dick Wall's Google Group post to my earlier reaction to Java language evolution and management concerns raised in the first twenty minutes of episode #277 of the Java Posse podcast.


Anyone can have an opinion. Having an informed opinion takes some effort. Implementing the conclusions of an informed opinion can take considerably more effort.

Project Coin != http://bugreport.sun.com/bugreport/

For many years people with ideas for language changes (and other matters) have been welcome to submit them to bugs.sun.com; there is no expectation that something other than a rough idea is required. These ideas are evaluated and there are well over 100 open "requests for enhancement" (rfes) related to language changes. I reviewed all of these open ideas before embarking on Project Coin. Many other submitted language rfes have been considered and subsequently closed.

Project Coin explicitly offered a different social contract than bugs.sun.com; beyond just a vague idea, contributors were invited to participate in the work of bringing a language change into the platform. To be clear, the abundance of open language rfes means an additional idea for language change in and of itself has essentially no value. Instead, the coin of the realm in Project Coin was the analysis of the impacts and benefits of a language change and code for a prototype implementation. Those are valuable because they are essential components of the work needed to bring a language change to the platform. The Project Coin Proposal form [1] guided the analysis and the OpenJDK langtools repository gave a starting point for a compiler prototype. People could collaborate on different aspects of a proposal and the Project Coin list explicitly made such requests for assistance on-topic. The recommendation for prototypes was not made for punitive purposes; rather it was made so that more accurate information could be gathered about the language change. Quoting a comment left by Mark Mahieu on my blog [2]:

"By the time I posted a link to a runnable prototype [of the Elvis operator] to the list, my own understanding was much, much clearer, and very different; after delving into the real detail I'd come to the conclusion that it's not a good enough fit (for Java) in the form proposed."

In contrast to this productive discourse, take the brouhaha over not including multi-catch in the Coin final five left in comments on my blog. [3] My message announcing the final five makes clear that this decision was made based on resourcing concerns rather than the merits of the idea itself. Not one of the people leaving comments full of wailing and gnashing of teeth about the omission offered to do anything to help implement the feature.

It is far easier to impose demands than to satisfy them. When there is no "cost connection" between those imposing demands and those satisfying them, ridiculous expectations can result, such as this individual [4] whose series of requests to jdk7-dev in June I will paraphrase:

"Hi. I'm some random Java developer with admittedly little technical expertise [5] and no money. I read one blog entry written by Neal Gafter several years ago [6]; despite my lack of money and admitted lack of technical expertise, I think reading that single blog entry written by someone else should imbue me with enough authority to dictate [7] how other people should allocate their resources to work on the cool Java language changes I personally want to see."

I have exactly zero respect for this line of thinking and see no reason to tolerate it.

If someone says he doesn't know what he is talking about, I believe him. I also take the next logical step of not giving much credence to his conclusions and demands.

No one is stopping this fellow or any other interested person from taking a compiler course, downloading the OpenJDK sources to javac, reading the considerable programming language literature of generics, and working on an implementation of reified generics or some other language variant. (The careful reader will note that Microsoft's papers describing reified generics in CLR emphasize how fast they were able to make List<int> go and do not focus on the performance penalty paid by List<Integer> because of the extra level of indirection between an object and its dispatch table.)

On the subject of listening to developers ideas for changes, I posted some related thoughts to the coin list back in July [8]:

"Design by committee" is often derided as an inappropriate way to manage technical projects. Simple polling about technical issues is design by committee where the committee can be arbitrarily large and any pretense of expertise in the area is removed. Therefore, polling would be a poor way to drive specific technical decisions about the Java Programming Language. One of the benefits of working in a technical field is that technical problems often have right answers, regardless of how many people agree or disagree with them.

This is not intended to be a slight against Java programmers who contributed suggestions informed by their daily experiences with the language to the Coin alias, to Sun's bug database, or elsewhere. Rather, it is a recognition that, just as being a player and being a coach are district roles requiring distinct skills, using the language and evolving it and district tasks with related but separate skills sets. Polling can provide a good indication about what pain points people are experiencing; that is an important input, but only one kind of input, to how the language should change.

Most Java programmers do not need to be language experts. This is very much a feature and not a bug of the Java ecosystem. Not having to be a language expert means Java programmers have time to be experts about other things :-)

Moreover, the responsibilities of stewardship including preserving the conceptual integrity of the platform, which does not necessarily follow from point decisions.

I don't understand who is being accused of preventing or impeding use of, say, Scala. For its part, Sun has encouraged experimentation with the Java language changes and has funded work to improve the support of non-Java language on top of the JVM too. Lack of corporate sponsorship of particular other languages should certainly not be equated to impeding their use. Conversely, Java developers are in no way obliged to participate in Project Coin or OpenJDK activities. However, if the extent of a developer's interaction with those working on the language is leaving childish comments on blogs, don't expect to have much influence over the results.

[1] Project Coin: Small Language Change Proposal Form Available
[2] http://blogs.sun.com/darcy/entry/project_coin_final_five#comment-1252023525000
[3] http://blogs.sun.com/darcy/entry/project_coin_final_five#comments
[4] http://mail.openjdk.java.net/pipermail/jdk7-dev/2009-June/000666.html
[5] http://mail.openjdk.java.net/pipermail/jdk7-dev/2009-June/000704.html
[6] http://mail.openjdk.java.net/pipermail/jdk7-dev/2009-June/000675.html
[7] http://mail.openjdk.java.net/pipermail/jdk7-dev/2009-June/000686.html
[8] http://mail.openjdk.java.net/pipermail/coin-dev/2009-July/002120.html

(2009-09-14 16:33:35.0) Permalink Comments [14]

20090910 Thursday September 10, 2009

Unhashing a String

My working on the strings in switch implementation has gotten swapped back in. The implementation is following the translation strategy outlined in the strings in switch proposal: find a perfect hash function for the input strings and semantically replace case "Foo": with case hash("Foo"): (along with some additional checks and logic) where hash("Foo") is computed as a constant by the compiler.

With this approach, to write the regression tests it is helpful to be able to construct a string with a given hash value to force collisions and test the alternate logic, which led me to write the "unhash" method below to create a string with a given hash code:

/**
  * Returns a string with a hash equal to the argument.
  * @return string with a hash equal to the argument.
  */
 public static String unhash(int target) {
    StringBuilder answer = new StringBuilder();
    if (target < 0) {
        // String with hash of Integer.MIN_VALUE, 0x80000000
        answer.append("\u0915\u0009\u001e\u000c\u0002");

        if (target == Integer.MIN_VALUE)
            return answer.toString();
        // Find target without sign bit set
        target = target & Integer.MAX_VALUE;
    }
        
    unhash0(answer, target);
    return answer.toString();
}

private static void unhash0(StringBuilder partial, int target) {
    int div = target / 31;
    int rem = target % 31;

    if (div <= Character.MAX_VALUE) {
        if (div != 0)
            partial.append((char)div);
        partial.append((char)rem);
    } else {
        unhash0(partial, div);
        partial.append((char)rem);
    }
}

The algorithm for hashing strings multiplies the old hash value by 31 and adds in the integer value of the next character:

h = 0;
for (int i = 0; i < len; i++) {
    h = 31*h + val[off++];
}

The unhash method works in reverse; for the non-negative values handled by unhash0, divide the target value by 31:

  • if the quotient is less than Character.MAX_VALUE, the quotient and remainder can be fully captured using at most two characters.

  • if the quotient is greater than or equal to Character.MAX_VALUE, unhash the quotient and append to that string a character for the remainder.

With some additional care, negative values can reuse the same process. The key observation is that if a string hashes to Integer.MIN_VALUE (0x80000000), subsequent multiples by 31 (0x1f) do not change the result since
0x1f × 0x80000000 = 0xf80000000
which is again 0x80000000 when limited to int range. Therefore, the sign bit can be set and then the remaining bits handled as if the target were positive.

Using a few minutes of computer time, I tested the unhash method on all integer values and it always returned a correct string. When available, exhaustive testing is pleasantly simple and reassuring! The generated strings are relatively short; as shown in the table below, for non-negative values the average length is slightly over four characters.

Distribution of String Lengths of Unhashed Non-negative Values
Length Frequency
1 31
2 2,031,585
3 60,948,480
4 1,889,402,880
5 195,100,672
Total 2,147,483,648

The unhash of 0 could be special-cased to return the empty string of length zero rather than a length-one string of the \u0000 character, but this was not necessary for the purposes at hand. Likewise, generating somewhat shorter strings for negative hash values may be possible, but further investigation is not needed just to be able to generate collisions. As is stands, unhash will return a string whose length is at most ten; five characters for a negative sign bit and at most another five characters for the non-sign bits.

(2009-09-10 11:50:30.0) Permalink Comments [7]

20090904 Friday September 04, 2009

Project Coin: Solidarity with C++ Evolution

Recently I read with interest Bjarne Stroustrup's HOPL III paper Evolving a language in and for the real world: C++ 1991-2006. Despite the numerous technical differences between Java and C++, I was struck by some of the similarities in community involvement and expectations in the evolution of both languages. Selected excerpts from the paper are in block quotes below.

In particular, this very open process [in the C++ committee] is vulnerable to disruption by individuals whose technical or personal level of maturity doesn’t encourage them to understand or respect the views of others. Part of the consideration of a proposal is a process of education of the committee members. Some members have claimed — only partly in jest — that they attend to get that education.

The Project Coin mailing list is a world-readable and world-writable list. While this approach does let anyone join in, the traffic can be very high and at times the signal to noise ratio was quite low. In the future, I'll be inclined to impose temporary moderation on the list to quell unproductive email storms.

The answer to “Why didn’t we provide a much more useful library?” is simpler: We didn’t have the resources (time and people) to do significantly more than we did.


The most common reaction to these extensions among developers is “that was about time; why did it take you so long?” and “I want much more right now”. That’s understandable (I too want much more right now — I just know that I can’t get it), but such statements reflect a lack of understanding what an ISO committee is and can do.


As ever, there are far more proposals than the committee could handle or the language could absorb. As ever, even accepting all the good proposals is infeasible. As ever, there seems to be as many people claiming that the committee is spoiling the language by gratuitous complicated features as there are people who complain that the committee is killing the language by refusing to accept essential features. If you take away consistent overstatement of arguments, both sides have a fair degree of reason behind them. The balancing act facing the committee is distinctly nontrivial.

Viewed over the long term, one goal to evolving a platform is trying maximize value delivered over time. This is analogous to a net present value-style consideration from economics. A feature delivered in the future is less valuable than having the feature today, but the value of choosing to do a feature needs to be weighed against the opportunity costs of doing something else instead. Developers are chronically optimistic and eager to deliver something sooner rather than later, especially when the next release vehicle may be in the relatively distant future. As previously indicated, I too would prefer to see additional language changes as part of Project Coin in JDK 7. However, given the available resources, overcommitting to a large set of features is not responsible; either the large set won't get done in the end, it won't get done well, or the schedule would slip — all of which lead to reduced value too.

Much of the best standards work is invisible to the average programmer and appears quite esoteric and often boring when presented. The reason is that a lot of effort is expended in finding ways of expressing clearly and completely “what everyone already knows, but just happens not to be spelled out in the manual” and in resolving obscure issues that—at least in theory—don’t affect most programmers. The maintenance is mostly such “boring and esoteric” issues.

Some attendees of my JavaOne talk this year were not happy with the length of time spent relating complications with adding enum types in JDK 5. However, I included such a large section on the apparent simplicity of enums still leading to many surprising complexities to help convey the disproportionate efforts that adding even modest features to the language can take.

I fully expect to be surprised in the future with novel interactions and issues as experience is gained with the Project Coin features and prudent planning anticipates the need to deal with such surprises.

(2009-09-04 17:39:48.0) Permalink Comments [8]

20090903 Thursday September 03, 2009

Java Posse #277 Feedback: Not a view from an ivory tower

The entry below is a slightly edited copy of a message I used to start a new thread on the Java Posse's Google Group, largely in response to comments make by Dick Wall in the first twenty minutes of episode #277 of the Java Posse podcast.


After listening to episode 277, I'm led to conclude I'm thought of by some as one of the "ivory tower guys" who "just says no" to ideas about changing the Java programming language.

I have a rather different perspective.

In November 2006, Sun published javac and related code under the familiar GPLv2 with Classpath exception. [1]

Shortly thereafter in January 2007, no less a Java luminary than James Gosling endorsed the Kitchen Sink Language (KSL) project. [2] In James' words KSL is "A place where people could throw language features, no matter how absurd, just so that folks could play around" since he has "... never been real happy with debates about language features, I'd much rather implement them and try them out." [3]

KSL received no significant community response.

Later in 2007, after the remaining core components of the platform were published as open source software as part of OpenJDK during JavaOne, in November Kijaro was created. [4] Kijaro is similar in spirit to KSL, but does note require contributors to sign the Sun Contributor Agreement (SCA). Before Project Coin, Kijaro saw a modest number of features developed, fewer than ten, which is also not a particular vigorous community response given the millions of Java developers in the world.

The earliest posts on what would become Project Coin mentioned the utility of prototypes, the Project Coin proposal form included a section to provide a link to an optional prototype, and I repeated stated throughout Project Coin the helpfulness of providing a prototype along with a proposal.

Despite the availability of the OpenJDK sources for javac and the repeated suggestions to produce prototypes, only a handful of prototypes were developed for the 70 proposals sent into Project Coin.

Dick asked rhetorically during the podcast whether alternative projects exploring language changes were inevitable as the "only approach given strict control exercised over the JVM [by Sun]."

IMO, such approaches are inevitable only if Sun's repeated efforts to collaborate continue to be ignored.

Classes on compilers are a core component of many undergraduate compiler science curricula. All the Java sources in the JDK 7 langtools repository adds up to about 160,000 lines of code and javac itself is a bit over 70,000 lines currently. These are far from trivial code bases and some portions of them are quite tricky, but implementing certain changes isn't that hard. Really. Try it out.

Dick says toward the end of the opening segment "If people do want to do this stuff, right now they are being told they can't."

I certainly do not have the authority to tell others what they can and cannot do. Indeed, I have advocated producing prototypes of language changes as a much more productive outlet than whining and pouting that other people aren't busy implementing the language changes you want to little avail. Others have already noted in previous messages to this group the irony of Lombok using the annotation processing facility I added to javac in JDK 6 as an alternate way to explore language changes (together with an agent API to rewrite javac internal classes!) . However, way back before JDK 5 shipped in 2004, we at Sun recognized that annotation processors by themselves would be a possible way to implement certain kinds of de facto language changes. The apt tool and later javac were always designed to be general meta- programming frameworks not directly tied to annotations; for example, an annotation processor can process a type containing no annotations to, say, enforce a chosen extra-linguistic check based on the structure of the program. [Such as the naming convention checker shipped as a sample annotation processor in JDK 6.]

As an example of what can be done just using annotation processing, long-time annotation processing hacker Bruce Chapman implemented "multi-line strings" as part of his rapt project [5]; the value of the string is populated from a multi-line comment. After repeatedly outlining how it would be possible to do so on the annotation processing forum [6], I've gotten around to hacking up a little proof- of-concept annotation processor based implementation of Properties. [7] The user writes code like

public class TestProperty extends TestPropertyParent {
   protected TestProperty() {};

   @ProofOfConceptProperty
   protected int property1;

   @ProofOfConceptProperty(readOnly = false)
   protected long property2;

   @ProofOfConceptProperty
   protected double property3;

   public static TestProperty newInstance(int property1,
                      long property2,
                      double property3) {
       return new TestPropertyChild(property1, property2, property3);
   }
}

and the annotation processor generates the superclass and subclass to implement the desired semantics, including the getter and setter methods, etc. Using annotation processors in this way is a bit clunky compared to native language support, but if people want to experiment, the capabilities have been standardized as part of the platform since JDK 6.

It is technically possible to take the OpenJDK sources and construct a version of javac that accepts language extensions; after all, this is how we generally evolve the language and also how the JSR 308 changes were developed before going back into the JDK 7 code base. Additionally, the IcedTea project and the shipping of OpenJDK 6 in Linux distributions has provided an existence proof that folks other than Sun can take the entire OpenJDK code base, add various patches and additional components to it, and ship it as a product.

Given the OpenJDK sources Sun has published, subject to the license and trademark terms and conditions, anyone is free to implement and use language changes, as long as they assume the costs and responsibilities for doing so. Experimentation has long been encouraged and experiences from experiments using language changes on real code bases would certainly better inform language evolution decisions. Unfortunately, others have generally not done these experiments, or if the experiments have been done, the results have not be shared.

I also do not have the power to prevent others from using non-Java languages on the JVM or to force others to run anything on the JVM, nor would I want to exercise such powers even if I had them. Indeed, for some time Sun has endorsed having additional languages for the platform and the main beneficiary of John Rose's JSR 292 work will not be the Java language, but all the non-Java languages hosted on top of the JVM.

I do have the authority to speak on what Sun will and will not spend our resources on in relation to Project Coin, certainly a right any organization reserves to do with its resources.

If there are frustrations waiting for Java language changes, I assure you there are also frustrations working on Java language changes. For example, I find it frustrating (and self-inconsistent) that people state "I don't have technical expertise in this area" while simultaneously expecting their preferences to be selected without any contribution on their part. [8]

Finally, going back to a white paper from 1996, the design of Java quite intentionally said "No!" to various widely-used features from the C/C++ world including a preprocessor and multiple inheritance. Again since the beginning, Java admittedly borrowed features from many other established languages. [9] Given the vast number of potential ways to change the language that have been proposed, many language changes will continue to be called and few will continue to be chosen. In any endeavor there is a tension to balance stability and progress. For the Java language, given the vast numbers of programmers and existing code bases, we try to err on the side of being conservative (saying "No." by default) first to do no harm, but also to preserve the value of existing Java sources, class files, and programmer skills.

There are many other fine languages which run on the JVM platform and I expect the Java language to continue to adopt changes, big and small, informed both by direct experiments with prototypes and by experiences with other languages.

[1] http://blogs.sun.com/ahe/entry/javac_open_sourced
[2] https://ksl.dev.java.net
[3] http://blogs.sun.com/jag/entry/compiler_fun
[4] https://kijaro.dev.java.net
[5] https://rapt.dev.java.net; see also Bruce's https://hickory.dev.java.net
[6] http://forums.sun.com/forum.jspa?forumID=514
[7] http://blogs.sun.com/darcy/entry/properties_via_annotation_processing
[8] http://blogs.sun.com/darcy/entry/project_coin_final_five#comments
[9] http://java.sun.com/docs/white/langenv

(2009-09-03 15:04:18.0) Permalink Comments [43]

20090828 Friday August 28, 2009

Project Coin: The Final Five (Or So)

First, thanks to all those who submitted interesting proposals and thoughtful comments to Project Coin; a demonstrably vibrant community wants to evolve the Java programming language!

Without further ado, the final Project Coin changes accepted for inclusion in JDK 7 are:

The specification, implementation, and testing of these changes are not final and will continue to evolve as interactions are explored and issues cleared. Two of the listed items are combinations of multiple submitted proposals. The omnibus proposal for improved integer literals includes at least binary literals and underscores in numbers; a way to better handle unsigned literals is desired too. Language support for Collections covers collection literals and allows for developing indexing access for Lists and Maps, assuming the technical problems are resolved.

That leaves a few proposals that went through further consideration, but were not accepted on the final list:

Improved exception handling would be a fine change for the language, but it ventures near the type system and we do not estimate we have resources to fully develop the feature within JDK 7. I would like to see improved exception handling reconsidered in subsequent efforts to evolve the language. While the Elvis operator and related operators are helpful in Groovy, differences between Groovy and Java, such as the presence of primitive types and interactions with boxing/unboxing render the operators less useful in Java. JDK 7 will support other ways to ease the pain of null-handling, such as the null checking enabled by JSR 308. Aggregates natively supporting more than 32-bits worth of entries will no doubt be increasingly needed over time. Language support for collections may allow such facilities to be developed as libraries until the platform directly handles large data structures.

The coin-dev list will remain available for the continued refinement of the accepted changes. Further discussion of proposals not selected is off-topic for the list.

The majority of the implementation of the Project Coin changes should be in JDK 7's development Mercurial repositories by the end of October 2009.

In due course, we intend to file a JSR covering the Project Coin changes.

(2009-08-28 17:45:48.0) Permalink Comments [67]

20090819 Wednesday August 19, 2009

Generics and the Mandelbrot set

Elaborating on some slides from a JavaOne talk, the Mandelbrot set is defined recursively as the set of values in the complex plane C where the iterations zn+1 = zn2 + c remain bounded, giving rise to a familiar and complex shape.

The Mandelbrot Set

Determining whether a particular point is inside or outside the boundary of the Mandelbrot set can be difficult because of the fractal nature of the curve; however, good approximations are possible. First at a coarse level, if the absolute value of a point is greater than 1, it is definitely not part of the Mandelbrot set so all of the Mandelbrot set is contained within a circle of radius 2 centered at the origin. Second, there are two primary curves within the set:

  • A circle of radius ¼ centered at (-1, 0)

  • A heart-shaped cardiord whose boundary is c = eit/2 - (eit/2)2

The overall area of the Mandelbrot set is a bit over 1.5; the circle has area ≈0.1963, 13.0% of the total, and the cardiord has area ≈1.178, 78.1% of the total. Therefore, together the circle and cardiord contain over 90% of the area of the whole set and it is comparatively easy to determine if a point is inside or outside the union of the circle and the cardiord.

The Mandelbrot Set Approximated

Using generics in Java has some similarities to the Mandelbrot set. Generics can be recursive, such as in the f-bound in the declaration of java.lang.Enum: public abstract class Enum<E extends Enum<E>>..., and it can be trickly to determine if a use of generics is reasonable. Fortunately, another similarity is that there are two primary use-cases for generics that cover the vast majority of sensible scenarios:

  • Generic aggregates, List<T>, Set<T>, etc.

  • Type tokens, Class<T> (or even super type tokens)

Aggregates like subtypes of java.util.Collection are the heart of generics usage and using generic collections is usually straightforward; Effective Java's PECS mnemonic (producer-extends, consumer super) provides guidance for some of the trickier cases. The second most common use of generics is for type tokens, Class<T>, which embody type information both at compile-time and at runtime. For example, type tokens are used to retrieve annotations.

Be wary of other uses of generics in Java. Java's generics have significantly technical differences from templates in C++; Java generics are by design not a Turing-complete meta-language! Attempting to use Java generics to simulate features in another languages, like Haskell's pattern matching, is unlikely to lead to pleasant or idiomatic Java code. In a Java program, using pervasive type parameters to pass along other information throughout a program, such as to address code evolution issues, is also not a pleasant fit. A warning sign in API design is a Java class having more than two type parameters; this likely signals generics are being used in an awkward way.

(2009-08-19 09:30:00.0) Permalink

20090818 Tuesday August 18, 2009

Project Coin: Elephants, Knapsacks, and Language Features

Paraphrasing some thoughts already sent to the Project Coin list and discussed at JavaOne this year, there continues to be traffic on the list and elsewhere about the criteria for proposal selection (and non-selection) and those criteria are worth elaborating ahead of the final proposal list being determined in the near future.

First, a reminder from some earlier blog entries describing the context for Project Coin:

"Especially with the maturity of the Java platform, the onus is on the proposer to convince that a language change should go in; the onus is not to prove the change should stay out."
Criteria for desirable small language changes, December 23, 2008

"Given the rough timeline for JDK 7 and other on-going efforts to change the language, such as modules and annotations on types, only a limited number of small changes can be considered for JDK 7."
Guidance on measuring the size of a language change, December 11, 2008

With nearly 70 proposals submitted to the mailing list and the Sun bug database having well over 100 open "requests for enhancements" (rfe's) for the language, the large majority of those proposals and rfe's will not be included in JDK 7 as part of Project Coin or any other effort.

Project Coin will be limited to around 5 proposals total. That's it.

Therefore for Project Coin, in addition to determining whether a proposal to change the language is in and of itself appropriate, a determination also has to be made as to whether the change is more compelling than all but four or so other proposals. In economic terms, there an an opportunity cost in the proposal selection; that is, because of finite resources, choosing to have a particular proposal in the platform removes the opportunity to do other proposals. There will be good, compelling proposals that would improve the language not selected for Project Coin because there are a full set of better, more compelling proposals that are more useful to include instead. Having available prototypes for proposals, running the existing tests, and writing new tests can only better inform the continuing proposal evaluation and selection.

Part of evaluating proposals is judging their utility to the Java platform as a whole. In this way, I've long thought the Java platform is like the elephant in the parable about the blind men and the elephant:

Six Blind Dukes and an Elephant

While each Duke and each of us may know and understand our own usage of Java quite well and have valid ideas about how Java could be changed to improve programming in our own areas of interest (properties! operator overloading! design by contract! your favorite feature!), putting together all that accurate local information might just result in a patchwork elephant:

Patchwork Elephant

Rather than just a collection of local views, a broad perspective is needed to have an accurate, unified picture of the platform so Java can keep evolving and improving as a general-purpose programming language. This approach favors features with broader applicability. For example, a usability improvement to generics, like the diamond operator which allows type parameters to be elided during constructor calls, is usable more often than, say, one of the various proposals to allow extensible or abstract enum types, a change that would only helpful in a small minority of enum declarations.

Even with a broad perspective, there are complexities in feature selection because choosing a set of language proposals is a kind of knapsack problem. That is, each feature has some discrete size and complexity to implement and confers some improvement to the language. There is a bounded size and complexity budget and the goal is maximizing the value held in the knapsack, the value of the set of improvements shipped in a release. Of note is that implementing a language change has much more of a discrete size (or a small selection of possible sizes) rather than a continuous range of possible sizes. In other words, because of the coordinated set of deliverables associated with a language change, it may be reasonable to implement 0%, 50,% or 100% of a possible feature but no other fraction. And doing 50% of the feature might take on quarter of the effort of doing the whole thing or three fourths of that effort.

Even when precise costs and benefits can be quantified, because of these discrete sizes the "greedy" algorithm of putting the highest value / cost item in the knapsack first can lead to globally poor results. If nothing else, having a pre-pass to reduce the number of proposals being considered for further review greatly reduces the combinatorial possibilities of subsets of features that could be included in a release.

(2009-08-18 01:26:27.0) Permalink Comments [2]

20090817 Monday August 17, 2009

JDK Release Types and Compatibility Regions

There are three primary kinds of compatibility of concern when evolving the JDK, source, binary, and behavioral. These can be visualized as defining a three dimensional space:

Compatibility Axes

The farther away a point is from the origin, the more incompatible a change is; the origin itself represents perfect compatibility (no changes). A more nuanced diagram would separate positive and negative compatibility, that is distinguish between keeping things that work working versus keeping things that don't work not working, but in this article the diagrams will just represent the magnitude of allowable compatibility change.

In turn, there are three main kinds of JDK releases:

  • platform

  • maintenance

  • update

JDK 7 is a platform release since it is a new version of platform; there are many new APIs added and thousands of bug fixes and enhancements. The JDK 6 update releases are representative of update releases; the same platform specification is implemented, Java SE 6 in this case, and there are typically dozens to a few hundred bug fixes and enhancements in a release (6 update train release notes). Like update releases, maintenance releases implement the same base specification as a previous platform release, such as JDK 1.4.1 and JDK 1.4.2 both being additional implementations of J2SE 1.4, but they have more bug fixes than an update release, on the order of one thousand to two thousand changes (JDK 1.4.2 release notes). While maintenance releases have not been formally issued since JDK 1.4.2, the changes in 6u10 were more on par with a maintenance release rather than a regular update release.

The general evolution policy for APIs in the JDK is:

  1. Don't break binary compatibility (as defined in the Java Language Specification)

  2. Avoid introducing source incompatibilities

  3. Manage behavioral compatibility change

While these policies hold for all three kinds of releases, the allowable compatibility regions differ for kind of release. For update and maintenance releases, the reference point to measure compatibility against is an earlier implementation of the same platform specification, such as the initial reference implementation of the platform or an earlier update release.

Maintenance and Update Release Compatibility

Since binary incompatible changes are not allowed, the acceptable compatibility region for update and maintenance releases is confined to the (Behavioral × Source) plane, with more latitude on the behavioral axis. For update releases, a limited amount of behavioral change is acceptable, where behavioral change is broadly considered to be any observable aspect of the platform. While programs should only rely on specified interfaces, they can often accidentally rely on implementation details of the release's behavior so update releases limit the overall change in behavior. Some minor changes affecting source compatibly can occur in an update release, for example, the version of an endorsed standard or standalone technology included in the release can be upgraded. The JAX-WS component was upgraded from 2.0 to 2.1 in 6u4 (6535162). Such upgrades should generally preserve the meaning of existing programs that compile and possibly allow new programs to compile. The main compatibility effect should be that the negative compatibility region may get smaller; programs that "don't work" or "don't compile" can become programs that "work" or "compile." Maintenance release are generally similar, but more behavioral change is allowed and expected since there are a greater number of bug fixes and enhancements.

The compatibility reference point for a platform release is an implementation of the previous platform specification. Compared to the previous platform specification, a platform release can add APIs and language feature that impact source compatibility (new keywords, etc.) and the implementation can have many changes in behavior (such as changing the iteration order of HashMap). In exceptional circumstances, there is the possibility of a sliver of binary incompatibility, such as to address a security issues in a rarely-used corner of the platform, but the central policy of preserving binary compatibility holds for platform releases as well.

Platform Release Compatibility

Comparing one build of an in-progress platform release to another, there may be large changes in binary compatibility before a new API is finalized. As a matter of policy, certain kinds of source incompatibility will not be introduced into the platform anymore. For example, the Java language in JDK 7 will have no new keywords that invalidate existing sources; instead of a full new keyword, JSR 294 is making "module" a restricted keyword whose use as a keyword versus an identifier will be disambiguated by the compiler.

Previously, there was a sharp jump down in the behavioral change allowed in a platform release compared to the first update of that new platform. A more helpful policy may be to allow greater behavioral change to new features in the first few updates of a new platform so that the implementation can be improved before widespread adoption justifies greater caution in managing behavioral change.

(2009-08-17 13:16:04.0) Permalink Comments [2]

20090813 Thursday August 13, 2009

JDK 7: New Component Delivery Model in the Works

The JDK includes many logically distinct sets of APIs. Some of the APIs naturally live in the JDK and evolve at the pace of the JDK; other APIs are effectively maintained externally, but are also shipped as part of the public API provided by the JDK. Two APIs in the latter camp are jaxp and jax-ws, both of which natively live in the GlassFish project.

Currently, those components are maintained under separate version control as part of OpenJDK in the jaxp and jax-ws repositories, respectively. Code in these components is periodically synced with changes from the upstream masters, with some nontrivial overhead.

To reduce the overhead of updating the components and thereby make it possible to update them more frequently, we're in the process of changing the delivery model of these externally maintained components into the JDK. First for JDK 7 and later for OpenJDK 6, instead of tracking these code bases under independent version control in the JDK, the JDK build will logically get the source for those components from a source bundle. The upstream teams will be responsible for providing source bundles and the JDK build will be configurable to use a particular source bundle.

Kelly has been working in implementing this new model for jaxp in the JDK 7 build, including working out the detailed logistics with the upstream teams. An initial version of the change is out for review.

This new approach is the same basic model the IcedTea project uses to provide changes and functionality on top of OpenJDK so there is lots of evidence large code bases can be handled using this model.

(2009-08-13 09:00:00.0) Permalink Comments [2]

20090811 Tuesday August 11, 2009

JDK 7 Build Prepped for Language Changes

After build hacking by Jon and Kelly, the JDK 7 build now uses -source 7 and -target 7 to compile the Java code, meaning the build is prepped allow use of new language features as they become available and to take advantage of version 51.0 class file features (6854244, 6827026).

(For bootstrapping purposes, the langtools repository housing javac and friends will remain buildable with JDK 6.)

(2009-08-11 17:29:00.0) Permalink Comments [1]

Calendar

« February 2010
SunMonTueWedThuFriSat
 
4
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
      
       
Today

RSS Feeds

XML
All
/Annotation Processing
/General
/Java
/JavaOne
/Numerics
/OpenJDK

Search

Links

    Blogroll
  • Download the JRE

    News

Navigation



Referers

Today's Page Hits: 66