GetJava Download Button XML Feed
All | About | Flying | General | Java | Solaris 10
20070411 Wednesday April 11, 2007

Tiered Compilation - issues

 

I got a comment in my last entry about tiered compilation that I'd like to respond. It is more convenient (and obvious) to do it here. Here is the comment: 

Can I try to understand better your "as it works now tiered tends to do too much compilation" comment?

Do you mean that HotSpot currently attempts to compile code prematurely but that that tendency is counterbalanced in practice by the compilations being queued and thus delayed? If so, can't the compilation threshold just be put back a little to achieve the same overall where there is less queueing (more CPUs)?

Or do you mean that you see the compilation taking CPU resources away from the app itself where that may not have been the best choice? Does that still apply where the app is not able to use all the available CPU resource at the time, eg not being threaded enough to use all cores in a Niagara CPU?

So part of the problem is as Damon surmised in the first question, but there is more to it than that. If you don't adjust the thresholds (Tier2CompileThreshold) you'll certainly do too much compilation because even though that threshold matches what server normally used the counters are being incremented by compiled code and not the interpreter and since the compiled code is faster than the interpreter we can hit the limit quicker. The counters do get decayed every so often but in a sense we lengthened that decay time period by running faster code. So without any adjustment you'll see a lot more server compiles queue for the same elapsed time in you app. lifetime. My experiments show that around 35k-40k is a better than 10k value which will be default in b12. Keeping the threshold working as expected in the default (tiered compilation off) kept the tier2 threshold at this low value.

But there is more going on than that. So other issues have to do with how profile data is collected and how the counters work. In the current world (not my development code) the counters are used as triggers and as measures of execution counts. These work against each other because of how the triggers are implemented. So the execution counts we get are kind of blurry. Also because of the speed of the interpreter we don't normally collect full profile data (what we call MDO data) until after a method has warmed up to a degree. This keeps us from slowing everything that gets interpreted when we only need data for hot methods. This also tends to (and not entirely by design I think) cause us not to see transients as the system starts up.

Well in the current (B12) tiered mode we collect MDO data in every c1 (client compiler) compiled method. So this causes us to see different profile data than we used to see. As it turns out this has more of an effect than you might expect. One of the most important optimizations that the compiler does is inlining a virtual call. Because of profiling data we can see, to a degree, that a call site has a single receiver object type so we can inline the called method based on that observation. It is not unusual, unfortunately, to see a call site change its properties about the receiver type with the initialization of the jdk essentially poluting the data for what is typical for the app.

So c2 (the server compiler) tries to handle this situation by tailoring the code at the call sites to try and get the proper behavior. So when predictions go bad the code is thrown out (via what we call an uncommon trap) and the code is recompiled based on new data. I was surprised to see how well tuned the system seems to be to not seeing the transients and getting the code right (in the prediction sense of right, not in the correct code sense). When c1 code is collecting the MDO data I tend to see more uncommon traps so that more compiles occur than you'd normally expect. Worse still because of how we recover in these situations you can end up with a full vtable call at a call site that is, by the time the app settles down, really monomorphic. So we end up compiling more to get worse code. :-( This is why I said that my expectation is that people that have been using -client to get good startup are likely to be happier with the current incarnation of tiered because they'll still get better final performance than c1 could deliver but that apps using server can see some loss of performance because the system doesn't get code that is as good as before.

Apr 11 2007, 09:57:00 AM EDT Permalink

Comments:

Kewl! Thanks for the explanation.

Still, I'm sure that you have a cunning plan for making the performance better on the server side.

Maybe you could have a heuristic that defers some/most/all C2 compiles while the system looks like it is not in a steady state (at start-up and other 'mode' transitions) to avoid doing too many 'wrong' compilations. Clues of 'unsteady' state might be a high/rising/changing level of class loading/init, uncommon traps, perm/tenured GC, threading levels (count, %age blocked), etc. Just some off-the-cuff thoughts. I have my coat... B^)

Rgds

Damon

Posted by Damon Hart-Davis on April 11, 2007 at 11:41 AM EDT #

Another thought: rather than simply deferring all C2 compilation at 'unsteady'/transient stages, you could throttle/reduce the number of C2 compilation threads to reflect lower confidence that the work done is worthwhile...

Posted by Damon Hart-Davis on April 11, 2007 at 11:47 AM EDT #

And another (tell me when you're bored): only allow C2 compilations when there are no pending C1 compilations and no recent C2-code "uncommon traps" as another measure of execution pattern stability...

Posted by Damon Hart-Davis on April 12, 2007 at 10:51 AM EDT #

can't wait to test b12, thanks for the great work :-) Unbelieveable how well hotspot's code is after all those implemented features. So the only improvement I am really waiting for is stack-allocation of objects.

Posted by Clemens Eisserer on April 13, 2007 at 10:20 AM EDT #

I think you'll find that stack allocation of objects is not as useful as you'd think. Since we do TLAB allocation the allocation cost is basically identical. The GC cost isn't that bad so we've not seen anything dramatic from stack allocation. What you really want is scalar replacement so that the object is never in memory at all. This requires the same analysis as doing stack allocation but has the added complexity of having to materialize the object into memory if we need to do a deoptimization of the method. That is not much fun.

Posted by fatcatair on April 13, 2007 at 10:38 AM EDT #

I don't know the costs of allocation+gc, but doesn't basically stack-allocation means allocation cost only with no additional work like cleaning up old stuff (since the stackpointer is moved anyway?).

I thought about such optimizations like scalar replacement (object merging, ...) too, but I never thought that Sun would implement such features just because of the fact how complex they are. I something really planned? Sorry for beeing so inquiring and stealing your time ;)

lg Clemens

Posted by Clemens Eisserer on April 16, 2007 at 11:06 AM EDT #

Yes, I thought it was the free/release side that was potentially a winner for stack allocation, since you would not have to test for liveness at all, so even cheaper than eviction from Eden...

Rgds

Damon

Posted by Damon Hart-Davis on April 17, 2007 at 05:42 PM EDT #

Hey, I really like your blog entries, great to read the grueling details. Question--do you know of written resources, or have you written, about the cost of maintaining all these measurements in the VM? How do you keep those measurements, traps, extra logic from slowing down the code you are hosting? Keep up the good blogs! Thanks, Patrick

Posted by Patrick Wright on April 20, 2007 at 05:39 PM EDT #

So as far as stack allocation is concerned I think that the allocation cost in time is identical. There may well be a some benefit from cache locality but it is hard to predict or generalize. So the additional benefit from stack allocation is basically the hope for fewer gc's. In some situations that could be significant but I think our experience is that it is in general not a big win. You seem to expect that that it imposes costs at collection time but that isn't true. If the object is live the scanning cost is the same if it is dead we'd never see it and the tlab is recycled at the cost relative to the objects that were live and needed to be promoted. Reseting the tlab pointers is more expensive than SP be adjusted on method entry/exit but we're talking about a few (4-6) instructions per thread per gc. That's not even noise level.

Posted by fatcatair on April 25, 2007 at 10:05 AM EDT #

I never waited to much for a new jdk snapshot, for over one weak I check the jdk7's project page several times a day ... but its so late :-/

Posted by Clemens Eisserer on April 30, 2007 at 05:07 PM EDT #

Yeah baseline pushes have been very slow of late. Hopefully soon we'll get back on a regular schedule. In any case B12 will arrive either this week or next week for sure.

Posted by fatcatair on May 01, 2007 at 11:38 AM EDT #

Today b12 arrived! I did some testing and performance seems to be really good. I'll do some further testing and let you know about how (well) it behaved :-)

Posted by Clemens Eisserer on May 07, 2007 at 08:21 AM EDT #

I switched from eclipse to netbeans-6-m9 to fully take advantage of Swing now compiled by the server-compiler. The performance is ok so far except from some stuttering (I guess when the server-compiler does work) - however a _lot_ of code is compiled. I set 1000/50000 for the thresholds but after working about 30min, the size of CodeCache is already at ~40mb. After 30min Netbeans consumes 383mb memory: 7279 ce 25 0 651m 383m 20m S 2 38.3 10:48.36 java However 15.000 classes are loaded so I guess this is just because netbeans is so fat and has nothing to do with the compiler/jvm itself ;) lg Clemens

Posted by Clemens Eisserer on May 08, 2007 at 02:46 PM EDT #

I've been at JavaOne this week so I'm sorry to be slow to your note. I'm glad to see that it seems ok for you. I had noted in a blog entry about a lot of compilation though I think in your case it more a case of so many classes. One of the other things that this brings up and in fact was talked about in meeting with customers this week is our (non)-policy of evicting things from the code cache. This is probably something we will have to look at sooner than I expected.

Posted by fatcatair on May 10, 2007 at 08:07 PM EDT #

Post a Comment:

Comments are closed for this entry.