Monday January 08, 2007
Tiered compilation - where is it?
It's about time I wrote about what I've been working on (although I've been getting requests for airplane building updates too). I've been trying to get the tiered jvm built as the standard vm to replace the server vm. My intention was to get the tiered system visible so we could get feedback on it sooner rather than later. In order to do that I needed to show that people wouldn't have a heart attack over performance. So as a result I'd been running benchmarks.
The good news is that using our internal benchmarking system (Alacrity) the server results were pretty much identical with the tiered vm as with the server vm. The client results which track startup performance weren't quite so good. Now that wasn't really bad for me because I only wanted to replace the server vm with the tiered vm at this point. In the future I want the client vm to go away too but given the state of the tiered vm it wasn't too surprising.
The Alacrity results for startup though were pretty bad so I spent some time investigating them. The suspicion of course is that now the the client compiler is generating code to track execution profiles that would explain it. So I built a client vm that would collect profiles via new switch(es) (switches because I wanted to see what different profile tracking code was the most expensive) and run the tests again. Well it definitely showed that this impacted client performance but to nowhere near the extent I was seeing with the tiered vm. The impact was pretty modest 6-10% or so for most things although there was one benchmark that was impacted pretty drastically.
So one other suspicion was that we didn't have code that allowed a thread that was execution a hot loop in client code to OSR (on stack replace) its way into server code. So I added that capability. That had no real impact. So although we'll need that capability at some point I didn't get anywhere adding that code.
The odd thing was that when I was running test locally on my workstation I wasn't really seeing the same kind of results that the Alacrity results showed. It finally dawned on me that the difference was probably class data sharing (CDS). The client vm supports class data sharing so that classes used during vm initialization are pregenerated when the vm is installed and later client runs can simply load the class data with a good bit less work. This helps startup time. The server vm because it uses a different garbage collector doesn't support CDS. Since when I work on the vm I rarely install it like a normal user would my client vm didn't have a shared archive to load.
So I modified the tiered vm so that when it was run as a client vm (which it was when it was tested in Alacrity) it would use the same garbage collector as the normal client vm and therefore allow dumping/loading of the shared archive. This was the pretty much that missing piece in the performance puzzle. Between the lack of CDS in the tiered vm and the cost of collecting profile data in the client compilers generated code I could account for the performance gap I was seeing compared to the vanilla client vm.
So it is obvious the next important things to attack:
1: get the garbage collector the server vm uses to support CDS. That is not something I have any expertise at so someone else will be doing that work. (My understanding is that it is doable but that it just wasn't seen as that important for server where startup preformance was not that key.)
2: see what I can do about the cost of profiling in the code the client compiler generates. That will be a topic for another day...
Start-up time *can* be important on a server. For example, if I have to restart a JSP then server I want it up and running as fast as reasonably possible to minimise user-visible down-time, and then I want that code to be C2-optimised to bits by HotSpot!
Thus my previous comments to you about remembering from one run to the next which routines turned out to be hot at startup so that they can be C1-plus-instrumentation-compiled upon their very first use on subsequent runs to help start-up performance...
Rgds
Damon
Posted by Damon Hart-Davis on January 08, 2007 at 11:41 AM EST #
Posted by fatcatair on January 08, 2007 at 04:06 PM EST #
Posted by Elliott Hughes on January 09, 2007 at 01:48 AM EST #
Posted by fatcatair on January 09, 2007 at 09:16 AM EST #
Why aren't I running the 32-bit JVM?
1. because the user gets to choose which JVM to install, not me, and because 64-bit sounds shiny and new, there are plenty of users who're installing 64-bit stuff whether they need the extra address space or not, and without measuring whether the extra registers (I'm assuming x86_64, because that's the world we live in) help more than the over-sized pointers hurt performance.
1a. why aren't I shipping a JRE as part of the app? Because that means much larger downloads, in the main. There are trade-offs both ways with including versus not including a JRE, but for now "too much bandwidth" and "enormous downloads put users off" will do.
2. JNI. Strictly, the fact that it's a real PITA to build multi-arch on Linux, and the JVM needs an appropriate-"width" shared library.
2a. why aren't I building a .so for i386 and one for amd64 and shipping them both? ...
3. ".deb"/".rpm". It's awkward, strictly incorrect (since you can't specify exactly what your i386 *and* amd64 package is, and have to lie that it's arch-independent, which annoys Linux/ppc and Linux/sun4 users), and against the packaging rules to stick multiple architectures' binaries in the same package.
4. You end up being forced to outlaw the running of a 32-bit JVM on a 64-bit OS, or break the packaging rules. You can't correctly specify your package's architecture, because when you say "amd64" you mean "...and your JVM damn well better be the 64-bit one too". So your amd64 package will install but it may or may not work, depending on whether the user has a 64-bit or 32-bit JVM, and that's why developers really don't want people going around recommending the use of 32-bit JVMs on 64-bit OSes. Though I've done so myself, for exactly the reasons you mention above. But I only recommend it to people like us, who know what we're doing and would recognize and understand what's gone wrong when our JNI-using Java applications fail.
This idea that 64-bit is "high-end" or "server" is so 1980s. Every computer my parents own is now dual-core and 64-bit. Every computer they could realistically buy this year is at least dual-core and 64-bit. It looks like 64-bit Windows won't be the default this cycle, but Mac OS 10.5 might be (Apple have been making a lot of noise about it being fully 64-bit), and Linux cycles are much shorter still, and there's no cost to trying it out, which is probably the reason why I'm seeing the most 64-bit users there. (On Mac OS the JNI thing isn't a problem because Apple's development tools are all geared towards cross-compilation anyway, and you can generate a fat .so that contains all architectures.)
How does the server compiler with CDS compare to the client compiler in terms of start-up time?
Posted by Elliott Hughes on January 09, 2007 at 12:15 PM EST #
Posted by fatcatair on January 09, 2007 at 12:56 PM EST #
Comments are closed for this entry.