GetJava Download Button XML Feed
All | About | Flying | General | Java | Solaris 10
20070111 Thursday January 11, 2007

Multi-tiers - 2

Well no one guessed the answer to yesterday's question as to why the benchmark ran so slowly. I had a couple of internal guesses and one external one. Azeem who used to work in the hotspot compiler group was pretty close. [I expect he might debate whether he was right...]

Truth is I didn't give you all the information but then it wouldn't have been that hard to guess. So the part I left out is what actually happens when we uncommon trap back to the interpreter? Well it depends on the situation to a degree because depending on the cause we want to take different actions. In this particular case the action we requested was to reinterpret for while. In that case we make the current compiled code dead and we reset the event triggers so we will recompile at a later point.

No in the server vm when we trap back to the interpreter we will collect more profile data. In the case that the benchmark presented we trapped because we hit a previously untaken path. In that case we will re-execute the brnaching bytecode and record that the previously cold path is at least more than stone cold.

So I had mentioned previously that in tiered the interpreter doesn't collect profile data. So trapping back to the interpreter won't record that the untaken path is now a possible, though improbable, path. Now since I wasn't thinking I had the tiered system when it tripped the counters for a new compile to always compile at the next level above where we compiled last, or at server level if we had gotten there. So in this case we go server code => interpreter server code.

So now you can see that maybe I was setup for a long set of cycles here since I didn't have a path that would mark the untaken path as taken. That would have been bad but in fact the formerly untaken path was actually still quite improbable. So although we could cycle here it was pretty infrequent so if nothing else had gone wrong the benchmark would have run slower but not as bad as what I was seeing.

Now it turns out that there were multiple uncommon traps happening in this same area of code. The site was a call site in java/util/Hashtable::get. So one of the important optimizations that happens is recognizing when a call site is monomorphic. In that case we can avoid the overhead of a virtual call and even possibly inline the target method. The profile data predicted that this call site was monomorphic. Since we can't rely on the site remaining monomorphic the compiler generates code that will uncommon trap if we find the wrong class. Now things are especially tricky here. Depending on the frequency of this we can do different thing. One thing we try to do is generate code for bimorphic call sites. If we see from the profile data that two classes will cover the site then a runtime test is inserted to chose and we can do non-virtual calls for both classes possibly inlining both targets. Similarly to the monomorphic case when we get a class we didn't expect we uncommon trap.

So as it turns out in running this benchmark the call site goes bimorphic after it has been running a while. Now remember the interpreter wasn't generating profile data. So when we'd trap out of the monomorphic case to go bimorphic we'd end up recompiling the case a monomorphic because the profile data never changed. Now this could really slow you down because this was a relatively hot path. But we have hueristics to cover this. What was supposed to happen was the compiler would detect that we were trapping at this particular point frequently and change its view of the call. Fortunately that trap counting was happening in the runtime system and not the interpreter so we did in fact see that we were trapping at this site with an unexpected class.

Now is where we ran afoul of the heuristics. There are throttling mechanisms that will prevent us from infinite trap/recompile cycles. If the system sees that it thinks that is happening it compiles the code to request that we no longer as for recompilations. We maybe seeing a lot of traps which is not good (in fact it may well be better to simply interpret) but at least we aren't also wasting cycles compiling constantly. What happened is I got caught by two competing hueristics. In order to have the trapping convert from monomorphic to bimorphic the compiler looks at the trap data per method . You might expect it to do it on a byte code index (bci) basis (we have that data too). I believe that because of races in how that data is collected and the recompile to go bimorphic we use the per method limit to give up on the optimization instead of the bci based data. However the decision as to what to do at the trap (for this particular type of trap) was on a bci basis and so we'd decide to not recompile on the traps before we'd give up and simply put a virtual call at the call site.

Once we decided not to recompile, the performance was doomed. We'd pretty frequently be trapping back to the interpreter and finishing the execution of the method. The cost of an uncommon trap is relatively high (they are supposed to be uncommon). So we ended up running even slower than what the interpreter alone would do.

Now here's one final thing about how odd this was. While I was tracking this down I was of course using the debug version of the vm. The debug version of the jvm has tons of assertion checking code. It's a lot slower. Actually we have two versions of the debug vm, fastdebug where the asserts are on, we optimize the c++ code and we generate symbol data for dbx/gdb. There is also the plain debug version where the c++ code is not optimized. Both of these vm versions easily out ran the normal vm. So why was that? Well the vm was slow enough that between the time we decided to compile the method and we actually got around to compiling it the call site went bimorphic. So the initial compile got it right immediately and we didn't uncommon trap at this spot at all. The fact that the vm was running slower than normal was completely overwhelmed by compiling this call site in the way that best suited the way the benchmark was running.

Jan 11 2007, 04:59:41 PM EST Permalink