GetJava Download Button XML Feed
All | About | Flying | General | Java | Solaris 10
20070411 Wednesday April 11, 2007

Tiered Compilation - issues

 

I got a comment in my last entry about tiered compilation that I'd like to respond. It is more convenient (and obvious) to do it here. Here is the comment: 

Can I try to understand better your "as it works now tiered tends to do too much compilation" comment?

Do you mean that HotSpot currently attempts to compile code prematurely but that that tendency is counterbalanced in practice by the compilations being queued and thus delayed? If so, can't the compilation threshold just be put back a little to achieve the same overall where there is less queueing (more CPUs)?

Or do you mean that you see the compilation taking CPU resources away from the app itself where that may not have been the best choice? Does that still apply where the app is not able to use all the available CPU resource at the time, eg not being threaded enough to use all cores in a Niagara CPU?

So part of the problem is as Damon surmised in the first question, but there is more to it than that. If you don't adjust the thresholds (Tier2CompileThreshold) you'll certainly do too much compilation because even though that threshold matches what server normally used the counters are being incremented by compiled code and not the interpreter and since the compiled code is faster than the interpreter we can hit the limit quicker. The counters do get decayed every so often but in a sense we lengthened that decay time period by running faster code. So without any adjustment you'll see a lot more server compiles queue for the same elapsed time in you app. lifetime. My experiments show that around 35k-40k is a better than 10k value which will be default in b12. Keeping the threshold working as expected in the default (tiered compilation off) kept the tier2 threshold at this low value.

But there is more going on than that. So other issues have to do with how profile data is collected and how the counters work. In the current world (not my development code) the counters are used as triggers and as measures of execution counts. These work against each other because of how the triggers are implemented. So the execution counts we get are kind of blurry. Also because of the speed of the interpreter we don't normally collect full profile data (what we call MDO data) until after a method has warmed up to a degree. This keeps us from slowing everything that gets interpreted when we only need data for hot methods. This also tends to (and not entirely by design I think) cause us not to see transients as the system starts up.

Well in the current (B12) tiered mode we collect MDO data in every c1 (client compiler) compiled method. So this causes us to see different profile data than we used to see. As it turns out this has more of an effect than you might expect. One of the most important optimizations that the compiler does is inlining a virtual call. Because of profiling data we can see, to a degree, that a call site has a single receiver object type so we can inline the called method based on that observation. It is not unusual, unfortunately, to see a call site change its properties about the receiver type with the initialization of the jdk essentially poluting the data for what is typical for the app.

So c2 (the server compiler) tries to handle this situation by tailoring the code at the call sites to try and get the proper behavior. So when predictions go bad the code is thrown out (via what we call an uncommon trap) and the code is recompiled based on new data. I was surprised to see how well tuned the system seems to be to not seeing the transients and getting the code right (in the prediction sense of right, not in the correct code sense). When c1 code is collecting the MDO data I tend to see more uncommon traps so that more compiles occur than you'd normally expect. Worse still because of how we recover in these situations you can end up with a full vtable call at a call site that is, by the time the app settles down, really monomorphic. So we end up compiling more to get worse code. :-( This is why I said that my expectation is that people that have been using -client to get good startup are likely to be happier with the current incarnation of tiered because they'll still get better final performance than c1 could deliver but that apps using server can see some loss of performance because the system doesn't get code that is as good as before.

Apr 11 2007, 09:57:00 AM EDT Permalink