GetJava Download Button XML Feed
All | About | Flying | General | Java | Solaris 10
20060901 Friday September 01, 2006

Least favorite bug

It seems only fair that if I wrote about my favorite bug report that I also take on my least favorite. That bug is this one. Now I'm not really complaining about the original submitter or even the bug itself. There is an issue about the bug itself, how hotspot reports bugs, and a few of the people responding to the bug.

This bug has a sort of interesting history. I got assigned this bug because I'm pretty much "mr. deopt" around here. Now unfortunately this bug initially had no testcase which always makes it hard to debug. If you decode the error id in this bug ( 434F444523414348450E43505000CF ) it decodes as:

codeCache.cpp, 207

which tells the line of source where the fatal error was detected. In this case that line is this code in CodeCache::find_blob():

guarantee(!result->is_zombie() || result->is_locked_by_vm() || is_error_reported(), "unsafe access to zombie method");

Now several things here. One I really hate those stupid error id's. Just useless in my opinion. When I first started working on hotspot I argued quite a long time against them. I don't see any benefit in them. The source file line number is more useful (to a degree) and much easier to remember or search for. Even the source/line number has its issues and in the case of this bug it has many issues (more than 80 in fact).

So the first thing I wanted to do when I got this bug was try and get a test case or even a stack trace. You would not believe how hard it was to do that.
In those days because of either privacy issues or lame database access or whatever it was pretty much impossible to get the email address of the initial submitter. In the meantime I'd get email every time someone "voted" for this bug.

Eventually I was able to get in touch with the initial submitter and got a stack trace. The stack trace made it obvious what the bug was for this submitter. In 1.5 the problem was fixed by the switch from eager-deopt to lazy-deopt. In 1.4.2 it was a pretty simple source change that went into 1.4.2_06.

So why is this my least favorite bug? Well if you read the bug you can see a fair number of people reporting this errorid and claiming the bug is not fixed. So what's up with that. Well I showed you the line of code with the guarantee call. If the guarantee in CodeCache::find_blob fails you will get this error id. Do you know how places in hotspot call find_blob? In 1.4.2 there are 80+ locations. That means a mistake in 80 separate pieces of code will all give the same error id. So here a thing I dislike about the error reporting in hotspot. If the line number reported were the id of the caller of find_blob then you might potentially get 80 some distinct bugs where the current reporting system makes it look like there is just one bug. Of course really there is even wider fanout than this but at least getting the caller source id instead of find_blob's id would be a vast improvement.

Now do I fault the jvm users that get confused by this and claim that I didn't fix the bug? Well not initially, at least not before I explained this issue somewhat in the bug report. I certainly admit it annoyed me. Fortunately once the bug got closed I stopped getting email for each vote which helped my atttitude. :-)

What really annoyed about this bug though was this report add in to it later:

The similar error occur in the folowing environment
Linux version 2.4.22-1.2115.nptl
(bhcompile@daffy.perf.redhat.com) (gcc version 3.2.3
20030422 (Red Hat Linux 3.2.3-6)) #1 Wed Oct 29
15:42:51 EST 2003

This is a error log
#
# HotSpot Virtual Machine Error, Internal Error
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Server VM (1.4.2_03-b02 mixed mode)
#
# Error ID: 53484152454432554E54494D450E435050014F
#
# Problematic Thread: prio=1 tid=0x4f2ceec8

Now notice the error id. Is it the same? Not even remotely close. So what led "malku" to say it was similar? Because it was a crash? I got no idea. Here's what is the worst thing about this bug. The error id decodes to:

sharedRuntime.cpp, 335

in this code:

guarantee(cb != NULL, "exception happened outside interpreter, nmethods and vtable stubs (1)");

(right after a find_blob call BTW).

Now this line of code in the vm is the mother of all duplicate bugs. If I thought the original bug with it's 80+ places to fan-in was bad, this oneis worse since it has an uncountable number of places.

This guarantee will fail whenever there is an unrecogized exception (not a Java exception, a machine level exception like segment fault [SEGV], divide by zero etc.) while a thread is executing in "Java" mode and somehow ends up in code we don't recognize and faults. Generally we leap into space and execute random code. There is simply no telling what is the initial cause. For this error id there can easily be hundreds or thousands (ok lets hope we don't have thousands) of initial culprits.

So now not only do I have people with very possibly different bugs for the original id complaining, because malku plugged a different error id into this bug and people Google that error id I have all those failures piling on also. It was just a mess.

Now at least thanks to writing this blog entry when people Google that ...14F id they'll probably get a hit here and at least see an explanation of what is going on. So some small amount of good will come out of it.

Now the good news is that error reporting is much better in 5.0 and 6.0. For this ...14F type of error we no longer report that poor piece of blameless code in sharedRuntime. Now we report the pc where the actual error happened and some clues about the stack and registers at the time of the fault. So at least the issues with ...14F are history. Unfortunately at this point the guarantee in find_blob is still doing its best to confuse the innocent about the root cause of the error. Hopefully in 7.0 things will be improved here too.


Sep 01 2006, 03:09:18 PM EDT Permalink

Comments:

Well, it's a really tough job

Posted by Parambir on September 03, 2006 at 01:26 AM EDT #

Thanks for posting this information. We just got the 53484152454432554E54494D450E435050014F error, but nothing helpful to go long with it. So we will hope it doesn't happen too often.

Posted by MAtt on September 08, 2006 at 01:51 PM EDT #

Post a Comment:

Comments are closed for this entry.