Wednesday January 10, 2007 So the tiered jvm is working pretty well but I want to see startup performance more on par with client. So I have to make some changes to try and help it out. So in the client vm the situation looks like this:
- Interpreter collects no bytecode based profile data but tracks invocations and backbranches on a per method basis. (Backbranches are used to trigger an OSR compile, invocations trigger normal compiles.)
- client compiler generates code that collects no profile data.
In the server vm the situation looks like:
- Interpreter initially collects invocation data and backbranch data on a per method basis. After a method heats up an object is created that allows collection of profile data on a bytecode basis. At this point backbranches counts are unique to the byte code index (aka bci, i.e. the index into a methods bytecodes).
- Server compiler generates code that collects no profile data. The compiler makes extensive use of the profile data. It is able to excise byte codes that based on the profile data are never executed. If we hit one of these expected to be un-taken paths there is code inserted to generate an uncommon-trap. This will cause us to execute in the interpreter again and collect more profile data.
In the tiered vm as it existed a few weeks ago the situation was:
- Interpreter collects no bytecode based profile data, ever. It works exactly like it was the client vm.
- The client compiler (initial tier) generates code to track invocation data and bci based profile data.
- Server compiler works just like it does in the server vm although the invocation count trigger point seems to be best around 35k instead of the server vm's 10k.
The initial theory was that client compiler code was going to be fast enough that collecting full profile data wasn't going to be that bad. I've done some benchmarking with Alacrity and it generally is too bad (10% or so) although there are a couple of benchmarks where the impact is pretty severe.
So in the server vm the interpreter runs in two modes the idea is to do the same for the client compiler and in effect add more tiers. Instead of a 3 speed transmission we go to 4 speeds to try and smooth out the startup performance differences I was seeing. Actually I've added an additional gear, the tiered system can actually create code that looks like:
- No profiling at all, just like in the client vm. Internally this is called tier1.
- Invocation counts and per method backbranch counting. This is called tier2.
- Invocations counts and bci based profile data. This is tier3.
Now the truth is that as I'm currently running the system I'm only using tiers > 1. The expectation is to use tier1 for special circumstances. For instance if someone find a method miscompiles at tier 4 (server compiler) we have a way to allow them to specify that method only reaches tier1. No sense in penalizing them further by collecting profile data we can't even use. Similarly there are methods that for various reasons (resource limits typically) the server compiler can't compile. In that situation we're better off using tier1 instead of being trapped in the interpreter like the current server vm is.
So I added all these tiers and I tried it out. It was pretty disappointing worse than the previous tiered system. It was clear that I was not able to control what compiles were happening and when. After some analysis it was clear what was going on. For various reasons the counters that are used for triggering compiles and for profiling information are partially shared. This was always a compromise at best in the other vms but in tiered I found that you just couldn't reason about how changes to triggering would influence my profile results. I really wanted triggering data to be separate from profiling data.
This was kind of scary. I wasn't the first hotspot developer to see that this overloading made for hard to predict changes in behavior. The current system has be tuned over a long time to get the kind of performance we want. No one wanted to mess with it for fear of spending inordinate amount of time getting the performance where we wanted with a saner system. Fortunately I'm not that smart so I'm changing it. :-) Actually I don't think I had much choice but it was nice to know that others found the counters not entirely rational.
So I split the triggering mechanism out completely. Pretty much immediately I got back to where I was with the initial tiered system and some benchmarks looked a little better. However I had one benchmark that used to run in 5.5 to 6.0 seconds that was now taking more than 400 seconds! What was up with that?
[ I know the answer to that question and I'll answer it next time. I've left clues as to what the problem is. See if you can figure it out. No prizes though... ]