GetJava Download Button XML Feed
All | About | Flying | General | Java | Solaris 10
20080108 Tuesday January 08, 2008

Any status on tiered compilation?

So it has been a quite a long time since I had anything to say about tiered compilation. Mostly this is because there wasn't a lot to say.

I've spent most of my working on a side issue. That issue is the lack of a 64 bit version of the client compiler. We've actually had a mostly working version of the client compiler for sparc for quite a while but there was nothing at on on the x86 side. Without a client compiler for 64 bit there was no way to have tiered work in that environment.

One of the reasons there was no 64 bit x86 version had a lot to with the way the original 64 bit port on x86 was done. The original implementation was done outside of Sun and the implementator went with a totally separate cpu directory (amd64) rather than i486 cpu directory contain both flavors much like sparc. As a result there was a lot of code duplication. Rather than expand the amount of duplication by adding a whole bunch of new c1 files I took the approach of merging the i486 and amd64 directories into a single x86 directory. Unlike sparc where 64 bit has a very similar ABI (very close anyway) and pretty much the same number of registers 64 bit x86 has more registers and a different abi for calling conventions (32 bit is purely stack based, 64 bit uses register based convention). Worse still solaris/linux have a different abi than windows. This makes merging the files into something as simple as the sparc side fairly difficult. The initial merge and creation of the x86 cpu directory happened some months ago. This resulted in the deletion of a number of files that were basically duplicates.

Once that step was complete I set about merging the two assemblers into a single assembler and adding features to the macro assembler to allow hiding the differing sizes of pointers. For example instead of saying addl or addq when adding to a pointer sized item now there is an addptr which does either a addl/addq as needed. This then allowed me to then make the changes to the client compiler cpu specific files such that almost all of the changes were pointer size agnostic.

I'm pleased to say that changes for 64 bit client jvm are now all complete. The changes just went out for internal review which given the size of them will take quite a while to process. This isn't that much of a headache at this point because of the change from using Teamware for source control to Mecurial we've not been able to putback (commit) any changes to the source base for weeks. This will hopefully end soon but as you can imagine there will likely be a tsunami of changes once the door is open. I'm not in a huge hurry to add to the confusion factor this is all likely to cause so it will still be a bit before these changes hit the OpenJDK sources.

In the meantime I should finally get around to going back to "real" work on tiered compilation. This will mean finally getting back to trying out the changes to compilation policy I've been envisioning for months. Hopefully this time things will work out better.

Jan 08 2008, 09:19:45 AM EST Permalink

20070821 Tuesday August 21, 2007

Tiered compilation update

Its been a while since I've gotten around to blogging and much longer since I've talked  about the state of tiered compilation. Just a quick note about what's been happening. Almost nothing. 

The only significant that has happened is that I've putback a fix to the jvm that will prevent the uncommon trap loops that for whatever reason seem to plage the jvm when tiered compilation is enabled. This fix should appear in B19 which should be built this week and will appears sometime later. This removes a large performance pothole that tiered  has suffered with. Generally if you've been having this problem then you will typically see your app run at interpreter speed or worse. In the past you could sometimes work around this by tuning the compile thresholds but not always. Hopefully this issue is pretty much dead.

Unfortunately tiered still has performance and tuning issues. I recently looked at some logs from someone that is using tiered and not seeing the results we'd like. At some compilation thresholds they are seeing the uncommon trap loop problem but at other levels that problem is gone but they see worse performance than just using client.

Their app seems to be my worst nightmare. The app seems to be composed of lots of methods that are mostly lukewarm and not many that are significantly hot. By messing with thresholds they may be able to get better behavior but I'm not optimistic.

This has convinced me that the techniques I'm using for controlling the tiered jvm are fatally flawed. They will never be able to do the correct thing for a variety of apps with out very careful and tricky tuning by the user. That just isn't going to make it. So I've been thinking about a new strategy and I've been working on protyping it. If this prototype shows promise then I'll be blogging about it in the near future.

Aug 21 2007, 11:32:32 AM EDT Permalink

20070611 Monday June 11, 2007

Some lessons learned

I recently got assigned a bug. It got assigned to me because the stack trace that came with it showed some of the deoptimization code in the trace. (Note that the current synopsis is different than when I got it and reflects the debugging I did.) When I got it I looked at the trace and said to myself "this isn't a deopt bug, it is likely a awt/2d malloc stomp", but I looked at anyway.

By the time I started looking at it Tom R. had already done a bunch of experiments in trying to track down which library was the culprit and whether a debug malloc would track down the problem. The results were confusing but the consensus was that if a debug malloc was used that the problem disappeared.

So the first problem for me was that the supplied testcase was running on linux and on my local network there weren't any suitable machine so I had to run this remotely. I thought this was going to be a pain since pointing the remote machine to my local display was going to be painful. First lesson learned: Years ago I asked about a way to deal with this situation but none of the people I asked had a good answer. Now thanks to Tom's suggestion I know that I can setup a vnc server on a dummy display and dispense with the network i/o. (Or on solaris I could use xvfb).  I wish I had learned that ages ago.

Once I started looking at the crashes in gdb I was even more convinced that the fault was not in the jvm. The data nearby the block that was being released didn't look like jvm data and looked suspiciously like graphics data (an array of coordinates). Unfortunately in reading the malloc sources I didn't do it carefully enough and misinterpreted what I saw. I believed that what I was seeing was a case of someone using a dangling pointer after the memory was freed rather than a more vanilla write off the end of storage type failure. (The fact that the debug malloc's didn't find it contributed to that misunderstanding.)

I thought about using valgrind but decided not to because I knew there were issues with using it with the jvm and I wasn't sure I wanted to go down that route yet. I'd save that approach if my current strategies didn't work.

I located the pieces of the 2d code that seemed to be related to the corruption and modfied them to put some crumbs in the storage when the malloced/freed storage so I could at least figure out what data was suspect. This led to the next lesson. I've been working on the jvm for a long time (>7 years) but I've never actually built anything but the jvm. I've never built an entire jdk. I was hoping to keep this record intact. Well we have a tool called PRT internally that is used by jvm developers do build and test the jvm on the supported platforms and if run is successful to putback the changes into the appropriate workspace. Well there is a new tool JPRT which does a similar thing for both the jdk and the jvm. We're supposed to be migrating to JPRT but I'd not used it to this point. JPRT has a lot of nice features and it allows you to specify which builds you want. So I was able to make my mods to two files in the 2d source and tell JPRT to build me a linux jdk. So I learned how use JPRT and still managed to not build a jdk by myself.

After a few iterations of building instrumented jdks and staring at crashes I was able to track down an out of bound store in some 2d code. After looking at the code I was able to come up with what looked to be a good enough fix to at least eliminate the stomp and see what happened. The crash was gone. So my initial prediction turned out to be accurate and I turned the bug and all my notes over to someone that actually understands the code for a proper fix.

Now for the final (hard) lesson.  After I had tracked this down to a simple oob store Tom and I wondered why the debug mallocs didn't find this. So Tom decided to try valgrind on the testcase. In very short order (like maybe an hour) it had located the same piece of code I had spent a good bit of time finding and one other suspicious routine. So I definitely learned respect for valgrind and will surely use it in cases like this in the future!

 

Jun 11 2007, 11:31:54 AM EDT Permalink

20070516 Wednesday May 16, 2007

JavaOne 2007

I'm back from JavaOne where I had a pretty good time. The weather started out real nice Mon. and Tues. but by Fri. it was really cold. I need to bring a heavier jacket next time as I managed to get myself a cold. I'm sure that some of those people on the plane with me have it now. :-(

This was the 3rd time I've been to JavaOne. I wasn't listed as a speaker this time so I was sentenced to the standby (aka Sun employees) line for the sessions. I need to come up with something to talk about next year. I was even demoted out of being an alumni. :-)

As usual I attended some good talks and some not so good ones. By far my favorite talk was  A Lock-Free HashTable by Cliff Click. It wasn't really java specific but it was clever and Cliff is a good speaker.

 My next most favorite talk was a surprise to me. It was Chris Oliver's JavaFX Script talk. I'm mostly allergic to hype and JavaFX was getting a lot of talk. I had read his F3 blog somewhat but not really that closely. I mostly went because it was the only thing I found interesting in that time slot. It was really cool. I was very impressed it is very simple for a programmer to produce very nice looking gui instead of the typical looking Swing/AWT stuff. Of course the real must have is a tool that allows building this kind of interaction without really having to think about the programming language. Still he had some very sweet demos.

Another  good talk was the Garbage-Collection-Friendly Programming by three members of the Hotspot JVM team  (not that I'm biased or anything). I didn't really learn much mostly because I hang around with these guys (electonically) quite a bit, but the talk had a lot of good stuff and was entertaining. They packed Gateway 102/103 and had questions until they threw us out of the room.

 The Hotspot BOF was the same night as the GC talk and I think that the people that didn't get their questions answered in the afternoon session came by for a second chance at night as most of the questions were GC related with very few runtime or compiler questions. The BOF was also in Gateway 102/103 which was a terrible room for a BOF. The room holds like 1500 people and we might have had 50. As a result there was a lot of echo and it was really very hard to hear the questions. I don't know how hard it was to hear the answers hopefully not as hard as the questions.

The final cool thing was that I had several people come up to me and read my badge and go "oh you're fatcatair!". So I guess I really have some amount of audience paying attention.

May 16 2007, 04:18:31 PM EDT Permalink

20070411 Wednesday April 11, 2007

Tiered Compilation - issues

 

I got a comment in my last entry about tiered compilation that I'd like to respond. It is more convenient (and obvious) to do it here. Here is the comment: 

Can I try to understand better your "as it works now tiered tends to do too much compilation" comment?

Do you mean that HotSpot currently attempts to compile code prematurely but that that tendency is counterbalanced in practice by the compilations being queued and thus delayed? If so, can't the compilation threshold just be put back a little to achieve the same overall where there is less queueing (more CPUs)?

Or do you mean that you see the compilation taking CPU resources away from the app itself where that may not have been the best choice? Does that still apply where the app is not able to use all the available CPU resource at the time, eg not being threaded enough to use all cores in a Niagara CPU?

So part of the problem is as Damon surmised in the first question, but there is more to it than that. If you don't adjust the thresholds (Tier2CompileThreshold) you'll certainly do too much compilation because even though that threshold matches what server normally used the counters are being incremented by compiled code and not the interpreter and since the compiled code is faster than the interpreter we can hit the limit quicker. The counters do get decayed every so often but in a sense we lengthened that decay time period by running faster code. So without any adjustment you'll see a lot more server compiles queue for the same elapsed time in you app. lifetime. My experiments show that around 35k-40k is a better than 10k value which will be default in b12. Keeping the threshold working as expected in the default (tiered compilation off) kept the tier2 threshold at this low value.

But there is more going on than that. So other issues have to do with how profile data is collected and how the counters work. In the current world (not my development code) the counters are used as triggers and as measures of execution counts. These work against each other because of how the triggers are implemented. So the execution counts we get are kind of blurry. Also because of the speed of the interpreter we don't normally collect full profile data (what we call MDO data) until after a method has warmed up to a degree. This keeps us from slowing everything that gets interpreted when we only need data for hot methods. This also tends to (and not entirely by design I think) cause us not to see transients as the system starts up.

Well in the current (B12) tiered mode we collect MDO data in every c1 (client compiler) compiled method. So this causes us to see different profile data than we used to see. As it turns out this has more of an effect than you might expect. One of the most important optimizations that the compiler does is inlining a virtual call. Because of profiling data we can see, to a degree, that a call site has a single receiver object type so we can inline the called method based on that observation. It is not unusual, unfortunately, to see a call site change its properties about the receiver type with the initialization of the jdk essentially poluting the data for what is typical for the app.

So c2 (the server compiler) tries to handle this situation by tailoring the code at the call sites to try and get the proper behavior. So when predictions go bad the code is thrown out (via what we call an uncommon trap) and the code is recompiled based on new data. I was surprised to see how well tuned the system seems to be to not seeing the transients and getting the code right (in the prediction sense of right, not in the correct code sense). When c1 code is collecting the MDO data I tend to see more uncommon traps so that more compiles occur than you'd normally expect. Worse still because of how we recover in these situations you can end up with a full vtable call at a call site that is, by the time the app settles down, really monomorphic. So we end up compiling more to get worse code. :-( This is why I said that my expectation is that people that have been using -client to get good startup are likely to be happier with the current incarnation of tiered because they'll still get better final performance than c1 could deliver but that apps using server can see some loss of performance because the system doesn't get code that is as good as before.

Apr 11 2007, 09:57:00 AM EDT Permalink

20070330 Friday March 30, 2007

Tiered Compilation almost live

A few days ago I putback the change that now causes the server jvm (32bit) to be built as tiered. It doesn't actually enable the jvm to run tiered by default but both compilers are present and you can optionally ask for tiered. I'm still having trouble tuning my multi-tiers jvm and I decided that at the very least I ought to see what feedback there was from people adventuresome enough to try the tiered jvm. I expect that there are pathologies and that you may have to tune the thresholds a bit.

It looks like for the current schedule that this will end up in jdk7 build 12 which is still about a week away. If it isn't in b12 it will certainly be in lucky b13.

You can tell if you have the tiered jvm  by running -server -version and you'll see something like:

Java HotSpot(TM) Tiered VM (..., mixed mode)

 In order to actually get tiered operation you'll have to ask for it via the switch -XX:+TieredCompilation

You may also want to experiment with the compiler thresholds via -XX:CompileThreshold=<nnn> and -XX:Tier2CompileThreshold=<nnn>. Currently the thresholds are 1000/10000. I've found that 1000/35000 works ok for me. I'm curious to see where others find the best tradeoff.

Mar 30 2007, 10:37:46 AM EDT Permalink

20070111 Thursday January 11, 2007

Multi-tiers - 2

Well no one guessed the answer to yesterday's question as to why the benchmark ran so slowly. I had a couple of internal guesses and one external one. Azeem who used to work in the hotspot compiler group was pretty close. [I expect he might debate whether he was right...]

Truth is I didn't give you all the information but then it wouldn't have been that hard to guess. So the part I left out is what actually happens when we uncommon trap back to the interpreter? Well it depends on the situation to a degree because depending on the cause we want to take different actions. In this particular case the action we requested was to reinterpret for while. In that case we make the current compiled code dead and we reset the event triggers so we will recompile at a later point.

No in the server vm when we trap back to the interpreter we will collect more profile data. In the case that the benchmark presented we trapped because we hit a previously untaken path. In that case we will re-execute the brnaching bytecode and record that the previously cold path is at least more than stone cold.

So I had mentioned previously that in tiered the interpreter doesn't collect profile data. So trapping back to the interpreter won't record that the untaken path is now a possible, though improbable, path. Now since I wasn't thinking I had the tiered system when it tripped the counters for a new compile to always compile at the next level above where we compiled last, or at server level if we had gotten there. So in this case we go server code => interpreter server code.

So now you can see that maybe I was setup for a long set of cycles here since I didn't have a path that would mark the untaken path as taken. That would have been bad but in fact the formerly untaken path was actually still quite improbable. So although we could cycle here it was pretty infrequent so if nothing else had gone wrong the benchmark would have run slower but not as bad as what I was seeing.

Now it turns out that there were multiple uncommon traps happening in this same area of code. The site was a call site in java/util/Hashtable::get. So one of the important optimizations that happens is recognizing when a call site is monomorphic. In that case we can avoid the overhead of a virtual call and even possibly inline the target method. The profile data predicted that this call site was monomorphic. Since we can't rely on the site remaining monomorphic the compiler generates code that will uncommon trap if we find the wrong class. Now things are especially tricky here. Depending on the frequency of this we can do different thing. One thing we try to do is generate code for bimorphic call sites. If we see from the profile data that two classes will cover the site then a runtime test is inserted to chose and we can do non-virtual calls for both classes possibly inlining both targets. Similarly to the monomorphic case when we get a class we didn't expect we uncommon trap.

So as it turns out in running this benchmark the call site goes bimorphic after it has been running a while. Now remember the interpreter wasn't generating profile data. So when we'd trap out of the monomorphic case to go bimorphic we'd end up recompiling the case a monomorphic because the profile data never changed. Now this could really slow you down because this was a relatively hot path. But we have hueristics to cover this. What was supposed to happen was the compiler would detect that we were trapping at this particular point frequently and change its view of the call. Fortunately that trap counting was happening in the runtime system and not the interpreter so we did in fact see that we were trapping at this site with an unexpected class.

Now is where we ran afoul of the heuristics. There are throttling mechanisms that will prevent us from infinite trap/recompile cycles. If the system sees that it thinks that is happening it compiles the code to request that we no longer as for recompilations. We maybe seeing a lot of traps which is not good (in fact it may well be better to simply interpret) but at least we aren't also wasting cycles compiling constantly. What happened is I got caught by two competing hueristics. In order to have the trapping convert from monomorphic to bimorphic the compiler looks at the trap data per method . You might expect it to do it on a byte code index (bci) basis (we have that data too). I believe that because of races in how that data is collected and the recompile to go bimorphic we use the per method limit to give up on the optimization instead of the bci based data. However the decision as to what to do at the trap (for this particular type of trap) was on a bci basis and so we'd decide to not recompile on the traps before we'd give up and simply put a virtual call at the call site.

Once we decided not to recompile, the performance was doomed. We'd pretty frequently be trapping back to the interpreter and finishing the execution of the method. The cost of an uncommon trap is relatively high (they are supposed to be uncommon). So we ended up running even slower than what the interpreter alone would do.

Now here's one final thing about how odd this was. While I was tracking this down I was of course using the debug version of the vm. The debug version of the jvm has tons of assertion checking code. It's a lot slower. Actually we have two versions of the debug vm, fastdebug where the asserts are on, we optimize the c++ code and we generate symbol data for dbx/gdb. There is also the plain debug version where the c++ code is not optimized. Both of these vm versions easily out ran the normal vm. So why was that? Well the vm was slow enough that between the time we decided to compile the method and we actually got around to compiling it the call site went bimorphic. So the initial compile got it right immediately and we didn't uncommon trap at this spot at all. The fact that the vm was running slower than normal was completely overwhelmed by compiling this call site in the way that best suited the way the benchmark was running.

Jan 11 2007, 04:59:41 PM EST Permalink

20070110 Wednesday January 10, 2007

Multi-tiers

So the tiered jvm is working pretty well but I want to see startup performance more on par with client. So I have to make some changes to try and help it out. So in the client vm the situation looks like this:

In the server vm the situation looks like:

In the tiered vm as it existed a few weeks ago the situation was:

The initial theory was that client compiler code was going to be fast enough that collecting full profile data wasn't going to be that bad. I've done some benchmarking with Alacrity and it generally is too bad (10% or so) although there are a couple of benchmarks where the impact is pretty severe.

So in the server vm the interpreter runs in two modes the idea is to do the same for the client compiler and in effect add more tiers. Instead of a 3 speed transmission we go to 4 speeds to try and smooth out the startup performance differences I was seeing. Actually I've added an additional gear, the tiered system can actually create code that looks like:

Now the truth is that as I'm currently running the system I'm only using tiers > 1. The expectation is to use tier1 for special circumstances. For instance if someone find a method miscompiles at tier 4 (server compiler) we have a way to allow them to specify that method only reaches tier1. No sense in penalizing them further by collecting profile data we can't even use. Similarly there are methods that for various reasons (resource limits typically) the server compiler can't compile. In that situation we're better off using tier1 instead of being trapped in the interpreter like the current server vm is.

So I added all these tiers and I tried it out. It was pretty disappointing worse than the previous tiered system. It was clear that I was not able to control what compiles were happening and when. After some analysis it was clear what was going on. For various reasons the counters that are used for triggering compiles and for profiling information  are partially shared. This was always a compromise at best in the other vms but in tiered I found that you just couldn't reason about how changes to triggering would influence my profile results. I really wanted triggering data to be separate from profiling data.

This was kind of scary. I wasn't the first hotspot developer to see that this overloading made for hard to predict changes in behavior. The current system has be tuned over a long time to get the kind of performance we want. No one wanted to mess with it for fear of spending inordinate amount of time getting the performance where we wanted with a saner system. Fortunately I'm not that smart so I'm changing it. :-) Actually I don't think I had much choice but it was nice to know that others found the counters not entirely rational.

So I split the triggering mechanism out completely. Pretty much immediately I got back to where I was with the initial tiered system and some benchmarks looked a little better. However I had one benchmark that used to run in 5.5 to 6.0 seconds that was now taking more than 400 seconds! What was up with that?

[ I know the answer to that question and I'll answer it next time. I've left clues as to what the problem is. See if you can figure it out. No prizes though... ]

Jan 10 2007, 03:22:20 PM EST Permalink

20070108 Monday January 08, 2007

Tiered compilation - where is it?

It's about time I wrote about what I've been working on (although I've been getting requests for airplane building updates too). I've been trying to get the tiered jvm built as the standard vm to replace the server vm. My intention was to get the tiered system visible so we could get feedback on it sooner rather than later. In order to do that I needed to show that people wouldn't have a heart attack over performance. So as a result I'd been running benchmarks.

The good news is that using our internal benchmarking system (Alacrity) the server results were pretty much identical with the tiered vm as with the server vm. The client results which track startup performance weren't quite so good. Now that wasn't really bad for me because I only wanted to replace the server vm with the tiered vm at this point. In the future I want the client vm to go away too but given the state of the tiered vm it wasn't too surprising.

The Alacrity results for startup though were pretty bad so I spent some time investigating them. The suspicion of course is that now the the client compiler is generating code to track execution profiles that would explain it. So I built a client vm that would collect profiles via new switch(es) (switches because I wanted to see what different profile tracking code was the most expensive) and run the tests again. Well it definitely showed that this impacted client performance but to nowhere near the extent I was seeing with the tiered vm. The impact was pretty modest 6-10% or so for most things although there was one benchmark that was impacted pretty drastically.

So one other suspicion was that we didn't have code that allowed a thread that was execution a hot loop in client code to OSR (on stack replace) its way into server code. So I added that capability. That had no real impact. So although we'll need that capability at some point I didn't get anywhere adding that code.

The odd thing was that when I was running test locally on my workstation I wasn't really seeing the same kind of results that the Alacrity results showed. It finally dawned on me that the difference was probably class data sharing (CDS). The client vm supports class data sharing so that classes used during vm initialization are pregenerated when the vm is installed and later client runs can simply load the class data with a good bit less work. This helps startup time. The server vm because it uses a different garbage collector doesn't support CDS. Since when I work on the vm I rarely install it like a normal user would my client vm didn't have a shared archive to load.

So I modified the tiered vm so that when it was run as a client vm (which it was when it was tested in Alacrity) it would use the same garbage collector as the normal client vm and therefore allow dumping/loading of the shared archive. This was the pretty much that missing piece in the performance puzzle. Between the lack of CDS in the tiered vm and the cost of collecting profile data in the client compilers generated code I could account for the performance gap I was seeing compared to the vanilla client vm.

So it is obvious the next important things to attack:

    1: get the garbage collector the server vm uses to support CDS. That is not something I have any expertise at so someone else will be doing that work. (My understanding is that it is doable but that it just wasn't seen as that important for server where startup preformance was not that key.)

    2: see what I can do about the cost of profiling in the code the client compiler generates. That will be a topic for another day...


Jan 08 2007, 10:06:09 AM EST Permalink

20060901 Friday September 01, 2006

Least favorite bug

It seems only fair that if I wrote about my favorite bug report that I also take on my least favorite. That bug is this one. Now I'm not really complaining about the original submitter or even the bug itself. There is an issue about the bug itself, how hotspot reports bugs, and a few of the people responding to the bug.

This bug has a sort of interesting history. I got assigned this bug because I'm pretty much "mr. deopt" around here. Now unfortunately this bug initially had no testcase which always makes it hard to debug. If you decode the error id in this bug ( 434F444523414348450E43505000CF ) it decodes as:

codeCache.cpp, 207

which tells the line of source where the fatal error was detected. In this case that line is this code in CodeCache::find_blob():

guarantee(!result->is_zombie() || result->is_locked_by_vm() || is_error_reported(), "unsafe access to zombie method");

Now several things here. One I really hate those stupid error id's. Just useless in my opinion. When I first started working on hotspot I argued quite a long time against them. I don't see any benefit in them. The source file line number is more useful (to a degree) and much easier to remember or search for. Even the source/line number has its issues and in the case of this bug it has many issues (more than 80 in fact).

So the first thing I wanted to do when I got this bug was try and get a test case or even a stack trace. You would not believe how hard it was to do that.
In those days because of either privacy issues or lame database access or whatever it was pretty much impossible to get the email address of the initial submitter. In the meantime I'd get email every time someone "voted" for this bug.

Eventually I was able to get in touch with the initial submitter and got a stack trace. The stack trace made it obvious what the bug was for this submitter. In 1.5 the problem was fixed by the switch from eager-deopt to lazy-deopt. In 1.4.2 it was a pretty simple source change that went into 1.4.2_06.

So why is this my least favorite bug? Well if you read the bug you can see a fair number of people reporting this errorid and claiming the bug is not fixed. So what's up with that. Well I showed you the line of code with the guarantee call. If the guarantee in CodeCache::find_blob fails you will get this error id. Do you know how places in hotspot call find_blob? In 1.4.2 there are 80+ locations. That means a mistake in 80 separate pieces of code will all give the same error id. So here a thing I dislike about the error reporting in hotspot. If the line number reported were the id of the caller of find_blob then you might potentially get 80 some distinct bugs where the current reporting system makes it look like there is just one bug. Of course really there is even wider fanout than this but at least getting the caller source id instead of find_blob's id would be a vast improvement.

Now do I fault the jvm users that get confused by this and claim that I didn't fix the bug? Well not initially, at least not before I explained this issue somewhat in the bug report. I certainly admit it annoyed me. Fortunately once the bug got closed I stopped getting email for each vote which helped my atttitude. :-)

What really annoyed about this bug though was this report add in to it later:

The similar error occur in the folowing environment
Linux version 2.4.22-1.2115.nptl
(bhcompile@daffy.perf.redhat.com) (gcc version 3.2.3
20030422 (Red Hat Linux 3.2.3-6)) #1 Wed Oct 29
15:42:51 EST 2003

This is a error log
#
# HotSpot Virtual Machine Error, Internal Error
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Server VM (1.4.2_03-b02 mixed mode)
#
# Error ID: 53484152454432554E54494D450E435050014F
#
# Problematic Thread: prio=1 tid=0x4f2ceec8

Now notice the error id. Is it the same? Not even remotely close. So what led "malku" to say it was similar? Because it was a crash? I got no idea. Here's what is the worst thing about this bug. The error id decodes to:

sharedRuntime.cpp, 335

in this code:

guarantee(cb != NULL, "exception happened outside interpreter, nmethods and vtable stubs (1)");

(right after a find_blob call BTW).

Now this line of code in the vm is the mother of all duplicate bugs. If I thought the original bug with it's 80+ places to fan-in was bad, this oneis worse since it has an uncountable number of places.

This guarantee will fail whenever there is an unrecogized exception (not a Java exception, a machine level exception like segment fault [SEGV], divide by zero etc.) while a thread is executing in "Java" mode and somehow ends up in code we don't recognize and faults. Generally we leap into space and execute random code. There is simply no telling what is the initial cause. For this error id there can easily be hundreds or thousands (ok lets hope we don't have thousands) of initial culprits.

So now not only do I have people with very possibly different bugs for the original id complaining, because malku plugged a different error id into this bug and people Google that error id I have all those failures piling on also. It was just a mess.

Now at least thanks to writing this blog entry when people Google that ...14F id they'll probably get a hit here and at least see an explanation of what is going on. So some small amount of good will come out of it.

Now the good news is that error reporting is much better in 5.0 and 6.0. For this ...14F type of error we no longer report that poor piece of blameless code in sharedRuntime. Now we report the pc where the actual error happened and some clues about the stack and registers at the time of the fault. So at least the issues with ...14F are history. Unfortunately at this point the guarantee in find_blob is still doing its best to confuse the innocent about the root cause of the error. Hopefully in 7.0 things will be improved here too.


Sep 01 2006, 03:09:18 PM EDT Permalink

20060822 Tuesday August 22, 2006

Microbenchmark Adventure

So I've been working on a new optimization for Hotspot that improves the speed we can call from Java (byte codes) to a Java Native method. There is unfortunately a lot more overhead on that path than we'd like and we take a fair amount of heat for it both internally and externally. At the moment the optimization is only available to the server compiler while I see how worthwhile the change is.

As part of investigating the performance of this I came this web page and the accompanying benchmark. So I downloaded it to see what my optimization did for this test. I'm always quite dubious of microbenchmarks because they often don't measure what the author expects because the compiler or OS performs in unexpected ways. Quite often the compiler is able to remove work that could never be removed in a real app and so the benchmark predicts more performance than you actually get. In this case the benchmark turns out to be flawed, but in opposite direction, predicting less Java performance that you might expect in a real app.

Well the first thing I had to do was to fix the SunOS.jmk file since as received it didn't really work on Solaris. So then I ran it and I could see that I got a boost but that the results were kind of noisy. So the benchmark has an option to run it multiple times ( -n <n> ) so I ran it five times to try and average out the noise.

So I observed a very strange thing. Here is a snippet of the results (without my JNI optimization):

Throughput:
        JavaRowConsumer 515583
        FineGrainedJNIRowConsumer       51986
        CoarseGrainedJNIRowConsumer     79021
        BytePackedJNIRowConsumer        106115
        SocketRowConsumer       66822
end throughput.
Throughput:
        JavaRowConsumer 203963
        FineGrainedJNIRowConsumer       32793
        CoarseGrainedJNIRowConsumer     72642
        BytePackedJNIRowConsumer        117230
        SocketRowConsumer       72205
end throughput.
...


So that's kind of strange. Why did JavaRowConsumer and FineGrainedJNIRowConsumer slow down so dramatically on the second iteration? I wondered if maybe something was broken in the benchmark so that the SocketRowConsumer was still doing work and competing against the next iteration. So I reversed the order of the controlling loop. Then I got:

Throughput:
        SocketRowConsumer       86094
        BytePackedJNIRowConsumer        151618
        CoarseGrainedJNIRowConsumer     77717
        FineGrainedJNIRowConsumer       31748
        JavaRowConsumer 195528
end throughput.
Throughput:
        SocketRowConsumer       73018
        BytePackedJNIRowConsumer        116724
        CoarseGrainedJNIRowConsumer     73709
        FineGrainedJNIRowConsumer       32610
        JavaRowConsumer 198806

So now JavaRowConsumer and FineGrainedJNIRowConsumer are consistent but SocketRowConsumer and BytePackedJNIRowConsumer seem to have the disease. So then I really read the code to the simulator. Then the light went on. In order to make the Simulator simple there is an Interface for a RowConsumer and there are five implementors of that interface. The driver for producer/consumer is passed different objects depending on the subtest that is happening.

So what happens is this. While the benchmark is heating up all the calls for the interfaces in RowConsumer are for a single type (whichever is first in the output above). So when we finally compile the code the call site uses that prediction and compiles for that single type (it can even inline it). Then the usage pattern changes and we go from monomorphic to bi-morphic and have to recompile. At the point I suspect we're prepared or a bimorphic call site. Later still that gets proven incorrect and we go mega-morphic and have to do a straight v-table call. At that point the results stabilize at a much lower value.

What's interesting is how dramatically the results change. The microbenchmark does so little work in the calls that how we do the call dramatically changes the results. This is probably not realistic. In fact the comparison to C is not particularly fair since the C code doesn't have to virtual dispatch since it isn't object oriented and there is in essence only a single implementor. So the results on the web page would seem to be skewed by that. To be fairer the C code should be C++ and have multiple implementors. But there is still more.

So in order to get something stabler I modified the benchmark. What I did was really lame. Essentially I change the main driver (Pumper::pump) to have a switch statement where there were 6 case alternatives, one for each type and a default. Then I had each type provide an identifier that I could switch on. This made the code for the method much larger but it ensured that each call site would be monomorphic throughout the run. It's ugly but sure enough the results were pretty similar for each iteration and always faster throughput for each subtest compared to the original benchmark. Showing that the benchmark was not very accurate at measuring what it was intending to measure.

Well I wanted to see if I could get the noise level down further by running the subtests longer. So I ran each subtest for 12000 milliseconds. Another surprise. The final report showed throghput noticably slower than the same five runs with a shorter subtest interval. WTF? Back to the source code where I found a bug. The report code when it computes the final throughput number throws out the 1st result and the last result and then computes the average row per second. There isn't anything wrong with throwing out the samples but in computing the averages integer arithmetic was used and depending on the interval you pick truncation could inflate your results. That's just what happened with the short interval. I modified the code to use floats and then sanity returned. So I was almost happy with the results.

There was just one thing that bothered me. On the original web page there was the claim that the "Coarse" JNI subtest ran slower than the "FineGrained" JNI subtest. Even my results showed the same thing (with my JNI optimization)

        JavaRowConsumer 503992
        FineGrainedJNIRowConsumer       92880
        CoarseGrainedJNIRowConsumer     66331
        BytePackedJNIRowConsumer        57715
        SocketRowConsumer       73069

That is kinda weird. I know how expensive that Java->Native transition and it is hard to understand how a benchmark that presumably does less transitions would be slower than one that does more. So I read the code for the two subtests. The FineGrain benchmark is certainly simpler in that for each transfer it does the transtion and the C code stores the transmitted value. The "Coarse" benchmark on the otherhand transfers every item from one Java array to another one (essentially duplicating JavaRowConsumer) and then when it is complete calls to native. The native code then does JNI calls to extract the data across the barrier. In the case of the two dimensional byte array it does two JNI calls per row. Now those JNI calls are very similar in cost to Java->Native transitions. So in effect the "Coarse" benchmark does even more JNI calls than the "FineGrained" version. No wonder it suffers.

So just for the heck of it I wrote a new subtest, RealCoarseGrainedJNIRowConsumer. On each transfer of a row of the 2-d byte array I packed it into a Java 1-D array that matches what the C side would use. The like the original "Coarse" benchmark at the end of the loop it make a single call to JNI. Now however on the native side I do one JNI call to transfer all the byte data. Now the results look like:

        JavaRowConsumer 510581
        FineGrainedJNIRowConsumer       94501
        CoarseGrainedJNIRowConsumer     67844
        RealCoarseGrainedJNIRowConsumer 167878
        BytePackedJNIRowConsumer        57726
        SocketRowConsumer       73522


That makes a lot more sense. Now there is just one more thing to check. How do we compare to C? Note: in all of the benchmark results above the results are not in rows per seconds. They are number of rows moved in the timing period which is what each sub-run reports. In this section the results are in rows per second so they are comparable with the C results. So here is row per second for the Java version:

Throughput in rows per second (bigger is better)
JavaRowConsumer 426398
FineGrainedJNIRowConsumer       78782
CoarseGrainedJNIRowConsumer     56141
RealCoarseGrainedJNIRowConsumer 139440
BytePackedJNIRowConsumer        48019
SocketRowConsumer       59574

and here is the CSimulator result (-O4)

gretch(pts/52)[1466] CSimulator
Throughput in rows per second: 195000

How about that, JavaRowConsumer clobbered the C version! My RealCoarse version  came in a 70% of the speed of C and the FineGrain version came in at 40% of the speed. So while using JNI for this test is slower than C it isn't as bad as the benchmark originally predicted. Why does the Java only version outperform C? Well by the time we compiler Pumper::pump the profiling the interpreter has done enables us to predict that on one of the case alternative we always See RowProducer and JavaRowConsumer and so the compiler inlines the code and all the call overhead is removed. We're left with just the data movement. This is an advantage of Java. Because the compiler in the JVM is able to optimize the program based on the way it is actually running it can do better optimization than a static compiler can do. Java will take a performance hit during startup but if we run long enough (and long isn't really that long) we often can outperform C or C++ code in equivalent apps.

Now what about the JNI performance. Obviously this benchmarks shows that there is an overhead in calling from Java->native and that it more expensive than a C->C call. We'd like that to be more competitive (the optimization I'm working on saves ~15%) but is this benchmark realistic of real apps. Probably not. Look at how much the results changed when the Java call went from monomorphic to megamorphic. A huge change. This shows that the work being done on the native side was pratically nil. In real apps I'd hope that your typical JNI call does a little more work than a single store and return. When that is true the cost of the transition is not nearly the limiting factor that this benchmark makes it appear to be.

So I believe this shows another example of why microbenchmarks should always be viewed with distrust. It is very easy to be fooled unless you are very familiar with the underlying compiler and runtime system. I've fooled myself plenty and I'm very familiar with the tricks the compiler and JVM are capable of.

My favorite microbenchmark story comes from my days as a compiler developer at Encore Computer. In thise days (late '80s early '90s) ther was a benchmark from AIM Technology called Aim-2 and Aim-3. The Encore C compiler would remove huge amounts of work from the benchmark. I always loved it when the benchmark would report that our machine was capable of performing more than 100 floating point operations per cpu clock cycle! Too bad the customers never saw that kind of performance. Now in the example today instead of predicting more performance than you can expect it actually predicts the opposite. When it comes to benchmarks, especially microbenchmarks YMMV.

Now even though I think this benchmark is misleading I saved a copy of the modified benchmark here. This is so you can inspect the difference between my version and the original. Don't buy a computer language based on it. :-)


Aug 22 2006, 10:19:38 AM EDT Permalink

20060816 Wednesday August 16, 2006

My favorite bug report

Many years ago I got assigned this bug. It is my favorite bug report off all the ones I've ever been assigned. This report was really impressive. A programmer at the company involved (a dot-com victim) actually downloaded the Hotspot source and built an instrumented jvm and got quite a lot of information about the failure. Certainly enough info to figure out a workaround.

When I first read this report I studied the sources and thought I had figured out the timing window that the bug exposed. I modified the jvm to make the window larger and tried and tried to reproduce the bug to no avail.

Well as it turns out the company reporting the bug was only about 10 miles away. So I called the developer (Justin) up and we arranged for me to come and see the failure at their site. When I got there we were unable to get the problem to reproduce. Probably spent most of the day before I left. Justin promised to get it to reproduce and call me when it happened and I'd drive back over. Well a week later he still couldn't reproduce it.

I probably kept a workspace devoted to this bug alive for 2 or 3 years. The bug was finally assigned to someone else and eventually closed for being not reproducible. That was pretty disappointing given the initial report.

We did find and fix a couple of other bugs that were similar but none that had the symptoms as tracked down by Justin. The good news is that the type of failure that happened in 1.3.1 isn't possible with the changes we made to the safepointing mechanism since Java 5.0. Still it was a great bug report.

Maybe next up is my least favorite bug report?
Aug 16 2006, 04:01:59 PM EDT Permalink

20060815 Tuesday August 15, 2006

Calling Conventions - part 2

Well by now you've probably given up hope I'd ever finish the story of calling conventions and why -Xbatch works differently in Java SE 6.0. What with a couple of weeks of vacation and catching up a month just flew by. Time to finish up.

As I said earlier in the previous release of Hotspot we used to have adapter frame whenever we called between (jit) compiled code and interpreted code or the reverse. In the 6.0 release we no longer have adapter frames. Now if you've been paying attention (which is hard given how infrequently I blog) you'd remember that we clearly must have to transform parameters on these transition paths so how does it happen? Well I said we no longer have adapter frames not that we no longer have adapters. We still have the adapter code they now don't leave a frame around for the stack walkers to see. This is good because the more frame types we have the more testing for various types we have in the code and so less is better.

Now in 6.0 because we are also working toward a tiered compilation system we moved the generation of adapter code out of the compilers and now they are generated as needed by some simple signature based shared code. What's different is how they have to go about massaging the parameters. Here's a picture:
 

   So we see that in this case the arguments were massaged by extending the frame of the caller and not by introducing a new frame that the adapter controls. As a result when we are doing stack walking the stack pointer that is reported for the compiled frame is SP(1)' and not SP(1). Now this extension causes problem because the compiler reports all of its usage of stack based locations as offsets from SP(1) and without some extra magic we'd look in the wrong stack locations for these compiler variables but we solved this problem long ago and that isn't the issue here.
  What is the issue is the race we talked about earlier. Now imagine that we finished executing the adapter code and the arguments are all ready for the interpreter and we go to start interpreting and now we find that we have compiled code for the method. What do we do? In the old adapter frame days we could simply call a i2c adapter and go on. Then what happens when we get to the compiled code and find that we've been unlucky and the code is now invalid and we have to interpret? Well in the old days we could simply unwind the i2c frame and be back to the c2i frame and we could continue.

With no frame left behind we can no longer simply unwind. Well we could unwind but it certainly isn't simple. So seemingly our choice is to call the c2i adapter again. This would in fact work but it has the potential of growing the stack without limit in extreme, unrealistic situations. Well the jvm has to continue functioning in those situations and crashing by running out of stack space is not a good idea. Now another choice is to somehow do this transition in place with the stack extension we already have. That is doable and in fact the licensee that first did framelesss adapters did just that. This worked for the architecture in question which was register rich but for something like x86 it was not practical. So instead we came up with a new invariant. "If you reach the interpreter, you will interpret". So the infinite c2i/i2c chain can never occur. One c2i and you're done.

Now in most situations (without -Xbatch) this is really like saying you just missed seeing the compiled code being installed in the old system since you lost the race. Except for the case of -Xbatch. With that mode we'd reach the interpeter and if a compile was triggered we'd wait until it completed and then run compiled. So -Xbatch in effect always let you win that race. In the new system that would violate our invariant and so while we wait for the compile to complete we don't get to execute the code on that particular invocation event. We'd execute it on each susbsequent call to the method so preformance isn't truly impacted, but it can show a difference in behavior in comparison to older jvm's.
Aug 15 2006, 12:55:32 PM EDT Permalink

20060623 Friday June 23, 2006

Calling Conventions

Well I actually got a comment saying that they were interested in the arcane reason -Xbatch works slightly differently in mustang. This is kind of long so it will take multiple entries before it is complete.

So the reason is tied to the calling conventions we use when a Java method calls another method. Now it is not too surprising if you think about it that the interpreter uses a stack based calling convention. Since the operations in the Java Virtual Machine are defined in terms of stack operations it makes sense. One of the other key points is how "locals" are created and used. The JVM spec. says that the N parameters we pass to called method (the callee) become the first N locals of the callee (Locals[0] .. Locals[N-1]). Now a method can and will usually have more locals than it has parameters. So for efficiency in accessing locals the interpreter will want to have the locals contiguous in memory. Imagine how slow the interpreter would be if every local access looked like:

   if (local_num > num_of_parms)
       ... locals[local_num - num_of_parms] ...
   else
       ... parms[local_num] ...

It would be pathetic. So clearly we want this in contiguous space. So this implies two things: interpreter wants to see parameters in memory and all locals must be contiguous. The first condition has an impact on the compilers.

Now it is clear that the interprete want parameters passed in memory. It is also reasonably clear that the compiler wants to pass parameters in registers if at all possible. Compilers try to avoid memory accesses because memory is just slow. So how does a compiled Java method call an interpreted Java method or vice versa? Well your first thought might be that the compiler knows it is calling an interpreted method so it should just do the parameters where the interpreter expected. Wrong!

This is a dynamic environment. Even if at the point the compiler was creating the caller's code it knew the callee was interpreted by the time the call executed we may now have compiled code for that callee. We sure don't want to always have that path run interpreted so we'd have to recompile the caller to  now use the compiled calling convention. Oh wait it's a dynamic environment, by the time we execute the call the system has decided for whatever reason that the callee's compiled code is invalid and now we must interpreter (at least until a recompile is complete). Now what do we do? Clearly we're getting nowhere with this approach.

Here's where adapters come in. So we produce small pieces of code that convert from compiled convention to interpreted ( C2I adapter ) or the reverse ( I2C adapter ). One thing to realize is we don't really need a separate piece of code for each one of these. We need to a unique ones for each signature we see. So two methods with the identical signature can share the same adapter. As it turns out prior to Mustang the server compiler would actually produce litle code blobs this way and share them. The client compiler would actually embed the I2C adapter code in the code for the method. So for the rest of this entry I'm going to be talking about how things looked when using -server.

So prior to Mustang when we would make one of these kind of call transitions the adapter code would actually leave behind a frame. If this is a new concept to you then you might want to stop reading now and wait for another episode of airplane building but I'll try and make it clear enough for the intrepid.

So every time a method is called the runtime environment gets modified so that we can allocate stack space for local variables, save registers the caller expects us not to destroy (like stack pointer the caller was using) and what program counter we need to return to for execution in the caller. This space is the "activation frame" or frame for those of us that don't type well. Now for many different reasons (garbage collection being a very frequent one) the virtual machine needs to be able to walk the stack and identify all the frames. So a piece of the calling convention is the notion of how frames are layed out so that if I have the stack pointer ( SP ) and the program counter ( PC ) of the youngest frame (the method currently executing) I should be able to find the SP and PC for every older frame ( [ SP(1), PC(1) ], [SP(2), PC(2) ], ... [ SP(n), PC(n)] ). Where for each older frame n grows by one.

So a question to ask here is this, "When we call from a compiled method to an interpreted method and clearly execute an adapter does the adapter leave a frame behind?" Well the answer is that prior to Mustang adapters would leave behind a frame. So here's a picture of what the stack would look like prior to Mustang.

So here we see that the youngest frame is an interpreted frame and that we have left behind the C2I adapter frame in between the compiled frame and the interpreted frame.

So what is bad about this and why would
we bother changing this. (Believe me before I was done with this change I asked myself this question a lot).

Well the biggest thing is that the stack walker has to be able to identify this frame and process it. 

 


From the point of view of the the frame handling and stack walking code in the vm they are just a nuisance. Code has to be aware of them and that costs us time in stack walking. For most of the system they really provide no benefit.

So lets look at a different picture. If you remember the reason I started this entry was to answer the question of why -Xbatch works differently. So imagine that a thread executing in a compiled Java method goes to call a method and there is no compiled code for that method. So we'll execute a C2I adapter and then go to enter the interpreter. No imagine that just as we reach the interpreter that we install compiled code for the callee. So we want to call the compiled code but now have the parameters in interpreter format on the stack. So what to do. Well we call an I2C adapter of course.

So here's what the stack looks like after we finally make it to the method we wanted to call. Pretty ugly. Now we have a C2I frame and an I2C frame that don't really have a big benefit and while this is rare the system (frame code, stack walker) have to deal with it. Things are really bad when we put deoptimization in the picture but that's a topic for much later.

Now those of you really paying attention might ask, "So what happens if just as you go to start executing the callee's compiled code the system decides that code is invalid and wants us to interpret?".
 
Now that's a truly evil question. Well we could just call the C2I adapter and go interpreted after all. But wait, what if just as we get to the interpreter there is newly recompiled code available? We're not getting anywhere (fast) and worse we could do this forever or at least until we run out of stack space while creating useless C2I/I2C transition pairs.

Well fortunately that isn't what we do (did). In this rare case we can unwind the I2C frame and then since there is an C2I frame ready for the interpreter we can proceed (modulo some register juggling). That way we can lose the race forever and accomplish no useful work but at least we won't blow out the stack. Worst case we'll see a single useless C2I/I2C pair.

Well that sets the stage for the changes in Mustang and why -Xbatch works differently but that's an entry for another day...





Jun 23 2006, 03:49:15 PM EDT Permalink

20060622 Thursday June 22, 2006

More tiered compilation

So I got this message embedded in the bug(rfe) for tiered compilation where a developer wanted to give us some suggestions about how the jvm should work. I've included it here so I can respond and also to clear up a possible misconception that might be present.

> I am concerned about start-up time before HotSpot gathers enough information to determine that a method
> will be (ie is already) hot, especially if that routine is going to be called 10,001 times, ie just
> once after it happens to have been compiled!  For start-up, HotSpot is often *just-too-late*
> compilation, especially for heavy work done in <clinit> and <init> methods once only.  I really
> do have routines that are run ~18,000 times at start-up and never again as it happens!

> May I suggest that you add to the tiered compilation the possibility to save in the current
> working dir the full signatures of methods hot enough for C1 or C2 optimisation on the previous
> run(s).  On subsequent runs those methods are compiled *before* their first call with cheap (C1)
> optmisation so that (1) if needed at startup they will run better than interpretted and (2) if
>  not actually needed for this run then not too much effort has been expended.  This is also
> safe as it does not (for example) store native code that could be tampered with or inappropriate
> for the next JVM to run the system.  It should also catch things like <clinit> code that might
> normally never be compiled.

So a repository of information collected from one run and used on another run is on the list for things we want to do in Dolphin. It's actually on the runtime groups list but we will certainly take advantage of it. There has also been talk of using annotations to give the jit a hint. This actually isn't too popular since it is too much like "register" declarations in C. It's only a hint and is too often abused.

Now the other thing in this message I wanted to clear up is the idea about how we transition from interpreting a method to a compiled version of a method. There is an implicit message here that if the compiler threshold for invocation is set at 10000 then when the method is called for the 10001 time it will run the compiled version. This is most likely not true and is much less true in mustang.

So when we decide to compile a method the thread that hit the counter overflow cause a compile event to get queued up. The compiler runs as a separate thread(s) (-server normally has two threads, -client can only use a single thread). Until the compilation is complete and the compiled code is installed every method invocation that occurs after the compile queue is updated will still run interpreted. So depending on whether the compile is a server or a client compile you may execute a lot or a few more invocations in the interpreter.

Now there is a switch -Xbatch that sort of used to, almost give you want was described above. This switch says that the thread that initiates the compile request will block and wait for the compile to finish. Now even with this switch set if another thread went to execute the method we're compiling then it would run it as interpreted even though the requesting thread would be blocked awaiting the compilation to complete. This is the "almost" side of the statement.

Now the reason I said "used to" is because the behavior changed in mustang. Prior to mustang the waiting thread would execute the compiled code as son as the compilation was complete. In mustang that is not the case. When the compilation is complete the waiting thread is released but it resumes executing in the interpreter. The -Xbatch switch only ever meant that the thread would block waitng for the compile there was never a promise of block and then execute the compiled code. The reason for this behavior change is actually somewhat related to work I did for tiered compilation. I won't go into it here as it is somewhat arcane but if your interested leave a comment and I'll do an entry that will probably tell you more than you care to know.

So the -Xbatch switch is not a particularly useful switch even in the pre-mustang days. It is somewhat useful for jvm delveopers since it tends to make a run more predictable and reproducible. One of the great debugging adventures of the jvm is the fact that things are not as repeatable as you might like.

Now this entry brings up one other topic that could be misunderstood. So hotspot does something we call On Stack Replacement (OSR). This can cause a thread that is executing in an interpreted method to execute in compiled code. Now you might be mistakenly led to believe that when we compile a method and install the compiled code that we OSR all the threads that are currently executing the method. This does NOT happen.

When we do an OSR it requires a very special compile to take place. So we decide to do an OSR when we observe a thread execute the back branch of a loop a specified number of times. So when this happens we queue a compile that requests a specialized compile of the method. The compile will treat then head of the loop body as the entry point of the method. In a sense the state of the method at that point becomes the arguments to the method at least as far as the compiler is concerned. So we are likely to not even generate the code that leads from the normal method entry to the loop so clearly this OSR compile is not useful for the general method call. Similarly the normal compile is not useful for OSR case (we can't predict the entry points before hand.) Even if we could predict them you wouldn't want to because of the possible impact on optimization.
Jun 22 2006, 12:33:41 PM EDT Permalink