The Analytic Tradition
On the design and specification of Java
Archives
« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today
Click me to subscribe
Search

Links
 

Today's Page Hits: 38

Main | Next page »
Friday Jul 31, 2009
Versioning in the Java platform

The best-versioned artifact in the Java world today is the ClassFile structure. Two numbers that evolve with the Java platform (as documented in the draft Java VM Specification, Third Edition) are found in every .class file, governing its content. But what determines the version of a particular .class file, and how is the version really used? The answer turns out to be tricky because there are many interesting versionable artifacts in the Java platform.

The source language is the most obvious. A compiler doesn't have to accept multiple versions of a source language, though javac does, via the -source flag. (-source works on a global basis; it is also conceivable to work on a local basis, accepting different versions of the source language for different compilation units.) Less obvious versioned artifacts are hidden in plain sight: character sets and compilation strategies. And .class files themselves sometimes have their versions used in surprising ways. Let's see how javac handles all these versions, and make some claims about how an "ideal" compiler might work.

In the remainder, X and Y are versions. "source language X" means "version X of a source language". "Java X" means "version X of Java SE". "javac X" means "the javac that ships in version X of the JDK".

Character set

Happily, the Java platform has used the Unicode character set from day one. Unhappily, when javac for source language X is configured to accept an earlier source language Y, it uses the Unicode version specified for source language X rather than Y. For example, javac 1.4 -source 1.3 uses Unicode 3.0, since that was the Unicode specified for Java 1.4. It should use Unicode 2.1 as specified for Java 1.3.

Claim: A compiler configured to accept source language X should use the Unicode version specified for source language X.

It is difficult for javac to use multiple Unicode versions since the standard library (notably java.lang.Character) effectively controls the version of Unicode available, and only one version of the standard library is usually available. We will return to the issue of multiple standard libraries later.

Sidebar: You may be surprised to discover that some other languages don't use Unicode by default. A factoid from 2008's JVM Language Summit was the existence of a performance bottleneck in converting 8-bit ASCII strings (used by dynamic languages' libraries) to and from UTF-8 strings (used by canonical JVM libraries). Who knows what the 2009 JVM Language Summit will reveal?

Compilation strategy

A compilation strategy is the translation of source language constructs to idiomatic bytecode, flags, and attributes in a ClassFile. As the Java platform evolves by changing the source language and ClassFile features, a compilation strategy can evolve too. For example, javac 1.4 may compile an inner class one way when accepting the Java 1.3 source language and another way when accepting the Java 1.4 source language.

Claim: A compiler may use a different compilation strategy for each source language.

The javac flag '-target' selects the compilation strategy associated with a particular source language. This mainly has the effect of setting the version of the emitted ClassFile: 46.0 for Java 1.2, 47.0 for Java 1.3, 48.0 for Java 1.4, 49.0 for Java 1.5, 50.0 for Java 1.6. For example, javac 1.4 could compile an inner class the same way when configured with a Java 1.3 target versus a Java 1.4 target *, but emit 47.0 and 48.0 ClassFiles respectively:

javac 1.4 -source 1.3 -target 1.3 -> 47.0

javac 1.4 -source 1.3 -target 1.4 -> 48.0

* It doesn't, as per Neal's comment, but suppose for sake of argument it does.

However, ClassFile version should be orthogonal to compilation strategy. For example, javac 1.4 could conceivably compile an inner class to a 48.0 ClassFile in two ways, one when configured to accept the Java 1.3 source language and another when configured to accept the Java 1.4 source language:

javac 1.4 -source 1.3 -target 1.4 -> 48.0
javac 1.4 -source 1.4 -target 1.4 -> 48.0

You would have to inspect the ClassFiles carefully to see the difference, since their versions wouldn't - don't - reveal the compilation strategy. Of course, the ClassFile version "dominates" a compilation strategy, since a strategy can only use artifacts legal in a given ClassFile version, even though the concepts are different. Joe has written more about the history of -source and -target.

The combination missing above is:

javac 1.4 -source 1.4 -target 1.3 -> 47.0

or, given that the target could refer strictly to compilation strategy and not ClassFile version:

javac 1.4 -source 1.4 -target 1.3 -> 48.0

javac does not accept a target (or compilation strategy) lower than the source language it is configured to accept. Each new version of the source language is generally accompanied by a new ClassFile version that allows the ClassFile to give meaning to new bytecode instructions, flags, and attributes. Encoding new source language constructs in older ClassFile versions is likely to be difficult. How would javac encode annotations from the Java 1.5 source language without the Runtime[In]Visible[Parameter]Annotations attributes that appeared in the 49.0 ClassFile?

Claim: A compiler configured to accept source language X should not support a compilation strategy corresponding to a source language lower than X.

This policy can be rather restrictive. There were no changes ** between the Java 1.5 and 1.6 source languages, and only minor changes in the 49.0 and 50.0 ClassFiles that accompany those languages (really, platforms). Nevertheless, javac 1.6 does not accept -source 1.6 -target 1.5.

** Except for a minor change in the definition of @Override to do what we meant, not what we said. Unfortunately, the definition changed in javac 1.6 but not in the JDK6 javadoc. Happily, javac 1.7 and the JDK7 javadoc are consistent.

The famous example of the restriction is that javac 1.5 does not accept -source 1.5 -target 1.4, so source code using generics cannot be compiled for pre-Java 1.5 VMs even though the generics are erased. This is partly because the compilation strategy for class literals changed between Java 1.4 and 1.5, to use the upgraded ldc instruction in the 49.0 ClassFile rather than call Class.forName. If javac's compilation strategy was more configurable, it would be conceivable to produce a 48.0 ClassFile from generic source code. There is however another reason why -source 1.5 -target 1.4 is disallowed ... read on.

Environment

Prior to JDK7, if javac for source language X was configured to accept an earlier source language Y, it used the ClassFile definition associated with source language X. For example, if javac 1.5 -source 1.2 reads a 46.0 ClassFile, it treats the ClassFile as a 49.0 ClassFile. This is unfortunate because user-defined attributes in the 46.0 ClassFile could share the names of attributes defined in the 49.0 ClassFile spec, and interpreting them as authentic 49.0 attributes is unlikely to succeed.

Even if javac 1.5 -source 1.2 reads a 49.0 ClassFile, there is little point in reading 49.0-defined attributes since they had no semantics in the Java 1.2 platform. This holds for non-attribute artifacts such as bridge methods too; if physically present in a 49.0 ClassFile, they should be logically invisible from a Java 1.2 point of view. In summary:

javac 1.5 -source 1.2 reading a Java 1.5 ClassFile -> should interpret as Java 1.2
javac 1.5 -source 1.5 reading a Java 1.2 ClassFile -> should interpret as Java 1.2

Claim: A compiler configured to accept source language X should interpret a ClassFile read during compilation as if the ClassFile's version is the smaller of a) the ClassFile version associated with source language X, and b) the actual ClassFile version.

In JDK7, javac behaves as per the claim. First, it interprets a ClassFile according to the ClassFile's actual version, regardless of the configured source language. For example, a 46.0 ClassFile is interpreted as it would have been in Java 1.2, ignoring attributes corresponding to a newer source language. Second, when the configured source language is older than a ClassFile, javac ignores ClassFile features newer than the source language it is configured to accept.

An important part of a compiler's environment is the standard library it is configured to use. The standard library used by javac can be configured by setting the bootclasspath. In future, a module system shipped with the JDK will allow a dependency on a particular standard library to be expressed directly.

Note that running against standard library X is deeply different than compiling against standard library X. Consider the Unicode issue raised earlier: javac implicitly uses the java.lang.Character from the standard library against which it runs, but should use the class in the standard library for the configured source language. For example, javac 1.6 -source 1.2 should use the Unicode in effect for Java 1.2 not Java 1.6. In this case, suitable versioning can only be achieved at the application level, by javac either reflecting over the appropriate java.lang.Character class or using overloaded java.lang.Character.isJavaIdentifierStart/Part methods that each take a version parameter.

Things also get tricky when compiling an older source language to a newer target ClassFile version (and hence a later JVM with a newer standard library). For example, should javac 1.6 -source 1.2 -target 1.5 compile against the Java 1.2 or 1.5 standard library? Both answers have merit, which suggests further concepts are needed to disambiguate.

Using the right libraries matters at runtime too. The introduction of a source language feature in Java 1.5 - enums - added constraints on the standard library against which ClassFiles produced from the Java 1.5 source language can run. The java.lang.Enum class must be present, and you can read the code of ObjectInputStream and ObjectOutputStream to see for yourself the mechanism for serializing enum constants. The simple way to guarantee that a suitable standard library is available for enum-using code at runtime is to ensure that only 49.0 ClassFiles are produced from the Java 1.5 source language. Such ClassFiles will not run on a Java 1.4 VM since it only accepts <=48.0 ClassFiles.

In a nutshell, the compilation strategy for enums is erasure++: an enum type compiles to an ordinary ClassFile with ordinary static members for the enum constants and ordinary static methods to list and compare constants. With a few changes in that strategy (to not extend java.lang.Enum) and a serious amount of magic in the Java 1.5 VM (to track reflection and serialization of objects of enum type), the ClassFiles emitted by a compiler for the Java 1.5 source language could run safely enough on a Java 1.4 VM. But the drawbacks to such hackery are enormous, so erasure++ it was.

Thus, the reason why one new language feature implemented by erasure - generics - cannot run on earlier JVMs is because another new language feature - enums - is implemented by erasure. Such is life at the foundation of the Java platform.

Thanks to "Mr javac" Jon Gibbons for feedback on this entry.

Posted at 10:27AM Jul 31, 2009 by Alexander Buckley in Java  |  Comments[3]

Tuesday May 12, 2009
Draft of the Java VM Specification, Third Edition

The Second Edition of the Java Virtual Machine Specification was published in 1999 and describes the Java SE platform circa JDK 1.2. Since then, numerous JSRs have updated the content, notably JSR 14 (generics) in Java SE 5.0 and JSR 202 (typechecking verification) in Java SE 6. Some of these updates are on the maintenance page for the Second Edition. However, no single document has been available that incorporated all these updates plus the smaller corrections and improvements that are made from time to time.

Certain JCP procedures are required to produce an official Third Edition of the Java Virtual Machine Specification. In the meantime, I am making available a draft of the Third Edition (ZIP, 1.9MB) to let the Java community observe the changing structure of the specification. There is an ongoing effort to identify and remove a) references to the Java Language Specification and b) assumptions about the compilation process that produced a ClassFile.

To emphasize the informal nature of the draft, I am not providing a change log or anything else that could be construed as starting a formal review. Nor are potential updates from JSR 292 and JSR 294 included; the draft pertains solely to Java SE 6 as defined by JSR 270 in 2006.

Posted at 06:43PM May 12, 2009 by Alexander Buckley in Java  | 

Sunday Feb 08, 2009
FOSDEM 2009

Back to Belgium for my first time at FOSDEM, where I presented on progress Towards a Universal VM in the Free Java track. (An updated form of my Devoxx talk with Brian Goetz.) It was good to meet Andrew Haley at last, chat with Martin Odersky about Scala and modularity, and of course catch up with Dalibor.

Posted at 06:06AM Feb 08, 2009 by Alexander Buckley in Java  |  Comments[2]

Monday Dec 22, 2008
Devoxx 2008

Devoxx is always a pleasure to attend for the energy and enthusiasm that 3000+ Java developers bring from all over Europe. I made two presentations this year: one on modularity in Java that dove-tailed with Mark Reinhold's keynote, and one with Brian Goetz on the JVM's progress towards being a "universal" VM for all programming languages. Thanks to the many hundreds who attended and gave warm feedback.

Bear in mind that these presentations are targeted at a broad developer base, and that we cannot address every last detail of a topic in 45-50 minutes. Just because something is missing from the slides doesn't mean we don't care about it, or that it's not important, or that it didn't come up in Q&A or a BOF.

In other news, Stephen Colebourne continued his fine tradition of taking the pulse of the community regarding Java language changes. By asking people to rank features globally, we gained crucial information over a local yes/no vote on each feature. With yes/no voting, imagine if 86 people vote yes for properties and 53 vote no, while 51 people vote yes for multi-line strings and 9 vote no. All you really learn is that the properties "community" is more divided than the multi-line strings "community". This saps authority from the larger number of yes votes for properties. Plus, yes/no voting allows different communities to talk past each other forever, ignoring the fact that Java language designers can consider the needs of only one community: everybody. So while there is debate about which features to include in the rankings, overall I was very pleased with how informative and decisive the community's rankings were.

Posted at 12:52PM Dec 22, 2008 by Alexander Buckley in Java  |  Comments[2]

Friday Sep 05, 2008
Named parameters

In a method call, it can be convenient to label the actual parameters according to the method's formal parameter names. For example, the method void m(int x, int y) {} could be called as m(x:4, y:5). This is especially worthwhile in two cases:

Named parameters raise some interesting questions:

If parameter order is fixed, then names could potentially be omitted. Consider the method:

void m(int x, int y, int z) {}

It's not difficult to match actual parameters to formal parameters, even without names:

m(x:1, y:2, 3) or
m(1, 2, z:3)

But if parameter order is variable, then omitting names is a disaster in the making. The call:

m(z:3, 1, 2)

makes you work to realize that x binds to 1 and y to 2. In the worse case, you destroy the convenience of reordering because you must match all actual parameters to formal parameters to understand the bindings, as in:

m(y:2, 1, x:3)

So, allowing reordering is the crucial question. Some will say that the whole point of named parameters is to aid readability of the caller in the context of an obtuse or verbose callee, and that reordering can improve readability. Others will disagree. Personally, I think the dissonance is greater when some parameters are named and some are not, than when all parameters are named but given out of order. Therefore, I would like to allow reordering and disallow omission.

Reordering actually has a profitable interaction with variable-arity methods. Consider:

void m(int x, int... y) {}

It would be nice to call it as:

m(x:1, y:2, y:3) or
m(x:1, y:{2,3})

Allowing reordering would allow the vararg parameter to come first:

m(y:2, y:3, x:1) or
m(y:{2,3}, x:1)

or even allow the vararg parameter to be distributed:

m(y:2, x:1, y:3)

This undoes the tradition of variable-arity parameters coming last, but then, they're only last because in any other position and without names, you can't differentiate a variable-arity actual parameter from a fixed-arity actual parameter. Named parameters with reordering make the last-position requirement unnecessary, and also the requirement for only one vararg per method. It would be quite reasonable to declare:

void m(int... x, int... y, int z) {}

and call:

m(x:1, y:100, x:2, y:200, x:3, y:300, z:1000)

if there is a natural association of x values with y values.

Clearly, reordering is a powerful concept. That usually means complexity. Let's look at how method call works, and how named parameters with or without reordering would affect it.

Into the heart of darkness: Overload resolution

Method resolution is the process of matching the method name and set of actual parameters supplied in a call to a method declaration in the receiver's class. If successful, the method call is resolved. Because the Java language and virtual machine support method overloading, a Java compiler uses the actual parameter types of a method call to select a single method declaration (obviously with the right name) with the "best" matching formal parameter types. This process is called overload resolution. It is rather complex because overloading is tricky in principle and because methods can be generic, of variable arity, and have formal parameters which require boxing/unboxing conversion of the actual parameters. On the bright side, the complexity is only at compile-time, because at run-time, the JVM's invokevirtual instruction simply calls a method with the exact name and formal parameter types chosen by the compiler. The important point is that the language and VM both use formal parameter types to resolve a method call. Static method resolution (i.e. overload resolution) and dynamic method resolution (i.e. invokevirtual's lookup) are aligned.

To involve parameter names as well as parameter types, there are two fundamental approaches. One is to use names as a simple static sanity check and leave static and dynamic resolution untouched. The other is to aggressively thread names through static and dynamic resolution. Let's compare the approaches.

Conservative

The conservative approach is to leave overload resolution unchanged, and add a final step to check that the name of each actual parameter matches the name of the corresponding formal parameter in the resolved method. Formal parameter names are not stored in classfiles today, but there is an RFE to make them available at runtime, and the first step would be to reify them in the classfile. Let's assume that's been done, and that a Java compiler can see the formal parameter names of any method declaration. Of course, if the resolved method is in a legacy classfile without formal parameter names, then the name-matching step must be skipped, but that's OK because it was only a sanity check anyway. The logic of this approach is simple, at the cost of not allowing reordering of actual parameters.

Aggressive

The aggressive approach is to alter overload resolution when it identifies the set of potentially applicable methods. The set would consist of precisely those methods whose formal parameter names match those at the caller, up to ordering and varargs (see below). There would never be a compile-time error about a call using an actual parameter name which is not a formal parameter name of the resolved method, because the set of potentially applicable methods is correct by construction. Which potentially applicable method is resolved is up to the actual and formal parameter types as usual.

This approach is distinctly unfriendly to migration compatibility, because compiling against a legacy classfile, without formal parameter names, will mean there are no potentially applicable methods for the call. Dropping back to traditional overload resolution based on classfile version is ugly, and having to ignore actual parameter names when they were intended to play a central role in resolution is repugnant.

Perhaps being aggressive is worthwhile if it allows richer overloadings than today, based on names as well as types? Currently, overloadings "erase" formal parameter names and are legal up to formal parameter types:

void m(Object x, String y) {} // m(Object,String)
void m(String x, Object y) {} // m(String,Object)

The following is illegal because both signatures "erase" to m(Object,String) (they are override-equivalent in JLS terms) so a call which does not use named parameters cannot differentiate between them:

void m(Object x, String y) {} // m(Object,String)
void m(Object y, String x) {} // m(Object,String)

You might say this is a shame, and that override-equivalence should take formal parameter names into account, because a call using named parameters can differentiate:

m(x:new Object(), y:"hi") // resolves to m(Object x, String y)
m(x:"hi", y:new Object()) // resolves to m(Object y, String x)

However, migration compatibility dictates that we can't assume all calls will used named parameters, and that it is unreasonable for aggressive overloadings which assume such callers to break other callers. Therefore, the overloading must stay illegal and we may as well keep static method resolution based on types. This is just as well, because some legal type-based overloadings cause problems for the aggressive named-based approach. Consider these methods:

void m(Object x, String y) {} // m(Object,String)
void m(String y, Object x) {} // m(String,Object)

and this call:

m(x:new Object(), y:"hi")

Its set of potentially applicable methods contains duplicates:

void m(Object x, String y) {} // m(Object,String)
void m(Object x, String y) {} // m(String,Object) after shuffling formal parameters to match the actual parameter ordering

The call is thus ambiguous. It makes no sense to aggressively change overload resolution to use names when doing so inherently rules out the use of names to call some legacy methods.

Since we say that no two methods can have formal parameters with the same types and different names, dynamic method resolution need not change. This is convenient because allowing name-and-type-based overloadings would mean significant VM changes. Today, a classfile cannot store duplicate methods (same name and formal parameter types) and the invokevirtual instruction could not differentiate between them in any case. A compiler would need to use invokedynamic to accurately resolve such methods.

As a matter of interest, is it possible to use names and types at static method resolution and then erase the names (akin to generics), so types alone are used at dynamic resolution? Not if you want to stay sane. Consider these methods:

void m(Number x) {}
void m(Integer y) {}

and the call:

m(x:5)

which statically resolves to m(Number). Changing the name of its formal parameter:

void m(Number z) {} // z not x
void m(Integer y) {}

means the types still match at runtime (i.e. it's a binary-compatible change) but recompiling the call would give a compile-time error because no methods are potentially applicable. This discrepancy is typical of features implemented by erasure. And not recompiling the call means invokevirtual now targets a less-specific method in terms of type (m(Number) rather than m(Integer)) for a reason (compile-time belief in the presence of an x formal parameter) which no longer holds. More discrepancy.

Finally, note the oddity that m(x:5) resolves to m(Number) while m(5) resolves to m(Integer). It seems that the aggressive approach is dead.

In conclusion, named parameters are possible in Java, but reordering - a considerable benefit - is incompatible with a practical design that preserves high levels of compatibility and usability. Nevertheless, named parameters increase readability where it's needed most, and would be an interesting addition to the language.

Thanks to Jon Gibbons, Maurizio Cimadamore, and Keith McGuigan for feedback and assistance.

Posted at 05:40PM Sep 05, 2008 by Alexander Buckley in Java  |  Comments[5]

Monday Jul 28, 2008
A wrinkle with 'module'

We hoped very much that the 'module' restricted keyword could be disambiguated everywhere in the language with only a fixed syntactic lookahead. That is, a compiler could treat 'module' as an identifier everywhere except in certain productions, for which a simple algorithm would use the immediate context to determine if 'module' was a modifier or an identifier. Even edge cases seemed to support this hope:

module class C { ...
module module module;
module module module() { ...

However, consider this code:

class foo {
module foo() { ...

Is foo() a method with a return type of 'module', or a module-private constructor? The former is legal today, and though it's very bad practice to have a method take the name of the class, it must remain legal in JDK7. So how can we disambiguate it from a module-private constructor?

One option is to do semantic analysis of the subsequent method body. javac currently perceives any 'return' statement in the body of a constructor, regardless of control flow, as an error, so just checking for no returns would be enough to claim it's not a constructor. But this level of analysis is completely inappropriate in a parser.

Another option, which we prefer, is to recognize that method name == class name is bad practice and that it's defensible to parse the term:

  'module' <identifier> '('
as a module-private constructor if the identifier is equivalent to the class name. If you're currently using the term to declare a method which returns a 'module' object, the compiler will complain about the method's 'return' statement(s) having an expression - but fear not, you will be able to put the 'package' modifier on your dubious declaration to make the compiler realize that 'module' is a return type not a modifier. (Actually, any accessibility modifier will do.) Admittedly, this means that the 'module' restricted keyword is not 100% backward-compatible, but it's pretty close. We've thought for years about introducing 'package' as an explicit modifier, to increase consistency and to allow package-package interface members. It finally looks like we have a compelling reason to do it.

If constructors were more strongly called out in the language, then no ambiguity would occur for 'module'. A 'constructor' modifier would suffice, or mandating a reserved name like 'init', or just defining any method-like declaration as a constructor if it has the name of the class.

Compared to class types, enum types are simple. Their constructors cannot be 'public' or 'protected' because creation of enum objects is heavily controlled. Making module-private constructors illegal is a no-brainer. Therefore, in a poorly named method (shares the enum's name), a 'module' identifier is automatically a return type. Interface types are also simple; with no constructors to worry about, the simple syntactic lookahead rules disambiguate 'module'-as-modifier from 'module'-as-identifier in any method, poorly named or not.

Edit: Clarified that a 'return' in an existing 'module <identifier> (' method would have an expression, which is illegal in a constructor.

Posted at 05:52PM Jul 28, 2008 by Alexander Buckley in Java  |  Comments[13]

Wednesday Jun 25, 2008
Bootstrapping modules into Java

We plan to modularize the source of the Java compiler in SE 7, i.e. group its packages into modules. As with any large piece of software, modularization brings more precise dependencies, a clearer API, and easier reuse.

A module-aware compiler will be needed to compile the modularized SE 7 compiler source. The compiler in SE 6 is not module-aware. This is a problem because Sun has a policy that the Java compiler source for SE n must be compilable by the compiler in SE n-1. (And that the resulting SE n compiler can execute on SE n-1.)

We can solve the problem via a two-step process:

1) Bootstrap. We will use the SE 6 compiler to compile the SE 7 compiler source, hiding the package-info.java files which associate packages in the SE 7 compiler source with modules. The result of the SE 6 compiler run will be a "bootstrap" SE 7 compiler which is module-aware (knows how to compile modularized code) but is not itself modularized.

2) Modularize. We run the bootstrap SE 7 compiler on SE 6 to compile the SE 7 compiler source again. Since the bootstrap compiler is module-aware, the package-info.java files can be visible. The result of the SE 7 compiler run will be a "real" SE 7 compiler which is module-aware and is itself modularized.

Now we have a modularized SE 7 compiler, we need a module-aware JVM to run it on. The SE 7 JVM is written in C++ so can be made module-aware independently of these compiler shenanigans.

What about core libraries? A modularized compiler should ship with modularized libraries. Compiling a modularized library requires a module-aware compiler, which happily we have after step 1. So in step 2, we can run the bootstrap compiler on SE 6 to compile not only the modularized SE 7 compiler but also the modularized SE 7 library source.

Ultimately, we have a fully modularized SE 7 reference implementation, containing a module-aware JVM, a module-aware and modularized compiler, and a module-aware and modularized set of libraries.

Posted at 06:01PM Jun 25, 2008 by Alexander Buckley in Java  |  Comments[5]

Wednesday Jun 04, 2008
Consistent module membership declarations

You can use the 'module' keyword at the start of a compilation unit to declare which module the unit's types belong to. This keeps important information about program organization close to the code. We require every compilation unit in a package to declare the same module membership, since no-one likes split packages. (I will not discuss package-info module declaration here.) What should a compiler do if it comes across inconsistent compilation units, or even inconsistent classfiles?

Consider these two compilation units, and that R is compiled first:

P/Q.java:
module M;
package P;
... new R(); ...

P/R.java:
module N;
package P;
public class R { ... }

Since R is public, it's not material to Q that R declares a different module than Q declares. We could let the inconsistency slide when compiling Q, only raising an error if Q tries to access a module-private member of R or some module-private type in P. But this would let a package become really split. There will be lots of public types joining modules and staying 'public', so the problem will be common.

Our plan is to require a compiler to give an error if any reference is made, from the current compilation unit, to a type claiming to be in the same package but different module. This catches potential split packages early. It won't matter whether the referenced type is in a compilation unit or a classfile, nor whether the referenced type or its referenced member is public. The error is logically the fault of the referenced type (P.R) though it should be reported in the context of compiling the referring type (P.Q).

Inspired by JLS 7.6, the rule is:

- The host system must enforce the restriction that it is a compile-time error if an observable compilation unit C belongs to a module which is not consistent with the module of any other compilation unit D in the same package as C to which code in C refers (directly or indirectly).

- The host system may choose to enforce the restriction that it is a compile-time error if an observable compilation unit belongs to a module which is not consistent with the module of any other observable compilation unit in the same package.

Posted at 12:39PM Jun 04, 2008 by Alexander Buckley in Java  | 

Thursday May 29, 2008
Versioning in the Java Module System

Stanley Ho and I have published a two-part blog entry on how the Java Module System tries to strike a balance between flexibility and readability in its versioning scheme. The first part is about the format of version numbers and the lessons drawn from 12 years of JDK version policy. The second part is about version ranges. If you have comments, it's best to leave them on the two entries, rather than here.

Posted at 05:03PM May 29, 2008 by Alexander Buckley in Java  | 

Thursday May 22, 2008
The effect of modules on 'protected'

Adding modules to the Java language has an interesting interaction with protected accessibility.

First, consider how protected accessibility is described in JLS 6.6.2: "A protected member or constructor of an object may be accessed from outside the package in which it is declared only by code that is responsible for the implementation of that object." The "responsible for implementation" phrase is a piece of morality which translates to the accessing code being in a subtype of the type which declares the protected member. Then, ignoring private, there is a total ordering of accessibility levels:

  public
|
protected
// other pkg if subtype
|
package

For module accessibility, the JLS will say something like: "A module-private member or constructor of an object can be accessed from outside the package in which it is declared by code that belongs to the same module." There will then be two accessibility levels greater than package and less than public:
               public
___________|____________
| |
| |
protected module
// other pkg if subtype // other pkg if same module
| |
|___________ ____________|
|
package

Is a total ordering possible? Consider the accessibility of a protected member:

When protected was invented, the types "responsible for implementation" of an object could be as far away as a different package. With modules, they could be as far away as a different module. This gives two options for which subtypes can access a protected member:

Suppose we decided, on a moral basis, that a single module is the unit "responsible for implementation". Then, protected members should be accessible to subtypes only in the same module. This induces a total ordering:

  public
|
module
// other pkg if same module
|
protected
// other pkg if subtype and in same module
|
package

But this decision is not very pragmatic. It means two packages where one has a protected member and one has an accessing subtype cannot be factored into two modules. We have no statistics on the popularity of putting related packages in different JAR files, but it seems a reasonable thing to do. Making protected strongly respect a module boundary seems likely to cause pain when moving types into modules. Elsewhere, we have gone to some lengths to ensure that publicly accessible types remain publicly accessible if they're thrown into modules. I conclude that multiple modules can jointly be "responsible for implementation", and that the appropriate interaction of protected and modules is for protected members to be accessible to subtypes in different packages even in different modules.

What about the accessibility ordering now?

So we're stuck with a partial ordering, albeit with a better understanding of what protected means w.r.t. modules:

               public
___________|____________
| |
| |
protected module
// other pkg if subtype // other pkg if same module
| |
|___________ ____________|
|
package

Finally, protected is really a meta-modifier - it modifies the package modifier (spelled '' of course) to add accessibility from subtypes. It's not unreasonable for protected to modify module too, such that module protected means accessible from:


giving the ordering:
              public
|
module protected
// other pkg if same module
// other module if subtype
___________|____________
| |
| |
protected module
// other pkg if subtype // other pkg if same module
| |
|___________ ____________|
|
package

The bane of protected is the obscure relationship between the type declaring the protected member and the qualifying type of the reference to the member, enforced during compilation and verification. But this relationship is independent of whether the accessing subtype is in a different package or a different module, so it is not a prima facie reason against module protected. Whether module protected is worthwhile is an open question - it can't be removed if it's somehow 'wrong', the classfile format would have to be updated, tools would need to parse it, etc, etc. module protected would however be an interesting counterpart to Peter Kriens' multi-module accessibility level, which one might denote module exported. Watch this space.

Posted at 08:29PM May 22, 2008 by Alexander Buckley in Java  |  Comments[3]

Tuesday Apr 22, 2008
Peter Kriens on language-level modularity

Peter Kriens, the OSGi spec lead and official evangelist, takes a positive view of language-level modularity. His focus on "requirements, not solutions" is especially helpful. Here are some responses to his points:


Module-private interfaces
There is little difficulty in allowing an interface to be module-private, since it can already be package-private or public. As for interface members, it was a nice simple approach back in the day to make them automatically "accessible outside my package", since that's completely what 'public' meant. (Of course the cost was excessive exposure of implementation methods.) Now that public is no longer the only "outside my package" level, it makes sense to allow module-private interface members:

// P/Q/I.java
module P;
package P.Q;
module interface I {
module void m();
}

// P/Q/R/C.java
module P;
package P.Q.R;
class C implements P.Q.I {
module void m() { ... }
}

Classloaders
Peter's thinking very much reflects mine.

Dynamic membership
Now that JSR 277 is the place for Java modularity, look for a unified module reflection API soon. Stanley Ho and I do understand why dynamic membership is important.

Module export
Peter identifies an additional accessibility level, 'export', between module-private and globally public. I might call it multi-module-private, since a type/member marked 'export' is accessible in its own module and from any module which imports that module. The difficulty is that the VM won't know about module imports so can't determine which other modules can access a multi-module-private type/member. The same is true of javac - it generally won't know if the caller's module imports the callee module where the 'export' type/member lives. (There will be a way to compile programs in the context of a runtime module system, but it will not be mandatory because people should be able to use modules simply as "better packages" without deployment overhead like module dependencies and packaging.)

Scoping
Peter proposes to qualify type names in classfiles with their declaring module. This is not strictly necessary because when the VM resolves a module-private type/member, it must compare the caller's module with the callee's module. It cannot rely on the caller's classfile claiming that the callee's module is so-and-so, hence I don't see the benefit in embedding the callee module in the caller. Indeed, this is the kind of excessive two-way dependency we're now trying to avoid. Also, having a caller commit to which module declares a type seems contrary to having the runtime obtain types from arbitrary modules (e.g. in the context of an import-by-package dependency).

Versioning
A standard version schema for Java types would be very interesting but not in the Java SE 7 time frame. We all need to think about this more. (See Alexander Krapf at JavaPolis and "UpgradeJ: Incremental Typechecking for Class Upgrades" by Bierman, Parkinson and Noble at ECOOP 2008.)

I look forward to more collaboration with Peter in future!

Posted at 02:31PM Apr 22, 2008 by Alexander Buckley in Java  | 

Thursday Mar 27, 2008
Module membership declarations

With my JSR 294 spec lead hat on, I recently proposed a change to the superpackage model which JSR 294 defines in the service of JSR 277's deployment modules. Early feedback has been positive, but where to declare module membership in source code is an ongoing issue.

When module membership is decentralized, i.e. 'module M;' appears in compilation units, each compilation unit that declares 'package P;' can declare a different module. But a package must not be "split" among different modules. If two types in the same package could be in different modules, then each type could access the other's package-private members but not its module-private members. This is stupid; types in different packages in a module can access each others' module-private data so types in the same package ought always to be able to access each others' module-private data. For analogous reasons, deployment module systems frown on split packages too.

The question for JSR 294 is how to enforce consistent module membership across all the compilation units which declare a given package. I could just write a declarative statement in the JLS and let javac worry about enforcing it:

"If a compilation unit C1 declares module membership M1 and package membership P1, and there exists another compilation unit C2 which declares module membership M2 and package membership P2, then P1==P2->M1==M2 or a compile-time error occurs."

But making javac inspect every source/class file in a package whenever a compilation unit in that package is recompiled will hardly be popular. Instead, a common idea is to declare module membership in package-info.java:


// P/package-info.java
@Imports(...)
module M;
@Foo
package P;

Module membership is now not declared near a type declaration, but it's clear from the presence of a 'module' modifier in a compilation unit that a predictable package-info file should be consulted by the compiler or developer. And there's no issue with types in an unnamed package (i.e. no package-info), since they can never be module members anyway.

However, things aren't as neat as they look. The moral argument against using package-info is that an artifact in a classfile, such as module membership, should have a direct representation in the corresponding source file. This is the kind of argument you think you can ignore in the short term but that you rue ignoring in the long term. The other argument against package-info is that you should not actually annotate a module declaration there even though it feels natural to do so. Enumerating module-level annotations is important for JSR 277, and the right solution is to declare them in a module-info.java file, following the precedent of package-info.java. So now you have module-info which says 'module M;' and package-infos which say 'module M;' and maybe we should just centralize the module's package list in module-info and leave package-info and normal source/class files alone? Sadly this doesn't work because the compiler and developer have no way of easily finding the "correct" module-info file for a given package.

In summary, when viewing a module bottom-up (compiler or developer reading a single source/class file), module membership should be in package-info for convenience; when viewing a module top-down (277 tool packaging a module given its name and constituent classfiles), module attributes should be in module-info for completeness. The moral argument, that module membership should be in normal compilation units for clarity, loses. Both module-info and package-info should be able to say 'module M;', just as compilation units and package-info can say 'package P;'. Annotations on 'module M;' in a package-info file are strongly discouraged.

Many thanks to Roman Elizarov for discussions on this topic.

Posted at 11:34PM Mar 27, 2008 by Alexander Buckley in Java  |  Comments[4]

Thursday Feb 14, 2008
Why new T() is not possible in Java

People sometimes think that 'new T()' would be possible iff generics were reified. This is not true. Consider:

class Foo<T> {
T f = new T();
}

With erasure, you implement 'new T()' as 'new Object()', since Object is the bound of T. With reification, you instantiate an object whose class is the dynamic binding for T in 'this'. Either way, you must execute a no-args constructor.

But Foo doesn't require that a type bound to T (a.k.a. a witness of T) has a no-args constructor. 'new Foo<Integer>()' is perfectly legal, but Integer doesn't have a no-args constructor, so how is the instance initialization expression supposed to call 'new T()'? It can hardly make up a default value to pass to Integer's constructor.

'new T()' is fundamentally not possible in the context of nominal type bounds. (Or, if you prefer, in a context of separate compilation, since a global compilation could compute that 'new T()' is sound for all observed instantiations of Foo.) C# 2.0 introduced a structural type bound called the new() constraint to permit 'new T()'. However, they already had a need for interesting rules about which types can witness a type parameter, and in that context the "public parameterless constraint" is straightforward. C++ "concepts" go further in allowing a structural description of the types able to witness a type parameter.

Java is not going to get structural type bounds any time soon. Nominal type bounds of the form C&I (an intersection type) are complicated enough. Consequently, neither erasure nor reification alone can support 'new T()'.

Posted at 07:17PM Feb 14, 2008 by Alexander Buckley in Java  |  Comments[8]

Wednesday Jan 23, 2008
Projects for javac

As you know, the syntax and semantics of a legal Java program are described in the JLS, so javac cannot just accept any program it likes. Patches to javac that change the space of accepted programs will always be contentious, but patches that improve the usability of javac itself would be well received by many people. There is obvious "low-hanging fruit" in diagnostics, because javac's reporting of type errors is less than ideal. Here are some ideas for mini-projects which would have a huge impact on programmer understanding and productivity:

Type derivation
An 'inconvertible types' error gives the source and target types. What is the series of type rules that derived them? (This would help to explain why compound assignments sometimes behave surprisingly with boxed operands, even though the behavior is there for a reason.)

Type membership
I recall numerous bug reports where javac appeared to get accessibility wrong, whereas in fact the submitter was confused about the members present in some types. These reports inevitably featured multiple packages, public types, package-private and protected members, and pathological inheritance. Only careful reasoning about inheritance, and hence the exact membership of a type, would explain javac's (correct) behavior. javac knows this membership already - why not display it? (The same can be said for intersection types.)

Formatting
This is a simple one: display messages in a hierarchical way, and have heuristics to simplify or abbreviate qualified class names. You could also imagine having short-form and long-form versions of some error messages, where the long-form version suggests a way out of the problem.

Capture conversion and type inference
Those 'capture-of-451#...' messages are tough. Why not display the type of an expression before and after capture, including the upper and lower bounds of synthetic type variables? Constraints for the formal type parameters of a method would be good to know as well. In fact, any additional info about the operation of overload resolution is valuable when no most specific method is available.

I wouldn't be surprised if the utility of an error message is inversely proportional to the amount of code within javac which must be unravelled to create it. In any case, these ideas will hopefully stimulate some discussion and experiments in the OpenJDK compiler group.

Posted at 03:42PM Jan 23, 2008 by Alexander Buckley in Java  |  Comments[3]

Friday Sep 21, 2007
Naming the null type

Everyone knows that java.lang.Object is the common superclass of all Java classes. It is also the common supertype of all interfaces, which do not 'extend' Object but do support the Object protocol. This makes it the Top type, useful for programming generic algorithms.

Top represents all values in a programming language. It ensures that the type hierarchy is a complete partial order by providing an upper bound for every pair of types. Computing the upper bound of types is what makes assignment and method call work (via widening reference conversion), so a well-founded type hierarchy is important.

(Ignore that the complete partial order for primitive types is distinct from the complete partial order for reference types. Sigh.)

The counterpart to Top is Bottom, a type that is the common subtype of all other types. Bottom makes the type hierarchy into a lattice because it ensures every pair of types has a lower bound. Lower bounds play a role in Java wildcards - specifically, capture conversion and type inference - so it could be useful to know that every type has a lower bound.

Java has the null type. Pre-JLS3, the null type was not officially the subtype of any type, and the null reference was not officially a value of any type except the null type. A decree made the null reference castable to any reference type for pragmatic reasons. (This is similar to the decree that makes List<?> assignable to a List<T> formal parameter even though List<?> is not a subtype of List<T>. You know the decree as capture conversion.) JLS3 defines the null type as a subtype of every type, so it looks an awful lot like Bottom.

(Strictly, JLS3 restricts the null type to be a subtype of every reference type. Again, just ignore primitive types.)

The null type is expressible, i.e. can be the type of a term. The compiler will expose it if necessary, e.g. int x = true?null:null;. But it is not denotable, i.e. cannot be written as the type of a term. You can't write NullType v = null;. An RFE asks for a name for the null type. Is this a good idea?

Beyond the use case in the RFE, being able to denote NullType would be useful in certain situations where type inference fails, because NullType may be a better actual type argument than Object. So that's in NullType's favor.

Bottom is usually not a denotable (or even expressible) type in textbook type systems because type rules must be special-cased to ignore it. (See Pierce 15.4,16.4) But in Java, the presence of a value for the null type means expression evaluations has always had to consider the null type, responding with a NullPointerException. (Indeed, the null reference means that the null type is not a true Bottom type.) Introducing NullType would allow more variables to store the null reference, but such variables evaluate to the null reference just like any variable of reference type can.

Statements would need tweaking. Consider the if statement: "The Expression must have type boolean or Boolean, or a compile-time error occurs." A type system with Bottom would allow the expression to have the Bottom type by subsumption, so traditionally an extra rule would catch that case and assign Bottom as the type of the whole statement. We just want if ([expression of null type]) ... to be illegal, so would need an "exactly" before "type boolean or Boolean".

[The first version of this blog entry said this wasn't necessary because final types didn't have any subtypes, not even the null type. Prompted by Remi's comment, I changed my mind. While a final class has no further implementations, special subtypes are possible.]

So, since Java already has the null reference, there is less problem adding NullType than if the null reference didn't exist.

Arrays cause a slight pain. A NullType[] can store only null values, which appears useless but someone will want it. On the face of it, we need the null type to be reified to enforce array covariance:

NullType[] n = new NullType[5];
Object[] o = n;
o[0] = new Object(); // Statically safe, dynamically unsafe - Object, a supertype of NullType, is now stored in n

To avoid reification, we could define NullType[] as a static equivalent of List<? super NullType>. Then, elements could be added to a NullType[] but not removed (except as Object). A more drastic idea is to make arrays of NullType denotable but uninstantiatable, like arrays of generic types. The value of all this is becoming questionable.

Denoting the null type is less useful in Java than might be expected. Consider the classic uses of the Bottom type:

To summarize, the null reference makes NullType in Java weaker than Bottom, which in turn makes NullType less problematic than Bottom but also less useful. No other major programming language denotes NullType, let alone Bottom, so it is hard to claim that Java is falling behind by not having NullType. It doesn't make things simpler, nor radically expand the space of programs that can be easily written, so don't look for it in JLS4.

Posted at 10:34AM Sep 21, 2007 by Alexander Buckley in Java  |  Comments[9]