Monday June 11, 2007 I recently got assigned a bug. It got assigned to me because the stack trace that came with it showed some of the deoptimization code in the trace. (Note that the current synopsis is different than when I got it and reflects the debugging I did.) When I got it I looked at the trace and said to myself "this isn't a deopt bug, it is likely a awt/2d malloc stomp", but I looked at anyway.
By the time I started looking at it Tom R. had already done a bunch of experiments in trying to track down which library was the culprit and whether a debug malloc would track down the problem. The results were confusing but the consensus was that if a debug malloc was used that the problem disappeared.
So the first problem for me was that the supplied testcase was running on linux and on my local network there weren't any suitable machine so I had to run this remotely. I thought this was going to be a pain since pointing the remote machine to my local display was going to be painful. First lesson learned: Years ago I asked about a way to deal with this situation but none of the people I asked had a good answer. Now thanks to Tom's suggestion I know that I can setup a vnc server on a dummy display and dispense with the network i/o. (Or on solaris I could use xvfb). I wish I had learned that ages ago.
Once I started looking at the crashes in gdb I was even more convinced that the fault was not in the jvm. The data nearby the block that was being released didn't look like jvm data and looked suspiciously like graphics data (an array of coordinates). Unfortunately in reading the malloc sources I didn't do it carefully enough and misinterpreted what I saw. I believed that what I was seeing was a case of someone using a dangling pointer after the memory was freed rather than a more vanilla write off the end of storage type failure. (The fact that the debug malloc's didn't find it contributed to that misunderstanding.)
I thought about using valgrind but decided not to because I knew there were issues with using it with the jvm and I wasn't sure I wanted to go down that route yet. I'd save that approach if my current strategies didn't work.
I located the pieces of the 2d code that seemed to be related to the corruption and modfied them to put some crumbs in the storage when the malloced/freed storage so I could at least figure out what data was suspect. This led to the next lesson. I've been working on the jvm for a long time (>7 years) but I've never actually built anything but the jvm. I've never built an entire jdk. I was hoping to keep this record intact. Well we have a tool called PRT internally that is used by jvm developers do build and test the jvm on the supported platforms and if run is successful to putback the changes into the appropriate workspace. Well there is a new tool JPRT which does a similar thing for both the jdk and the jvm. We're supposed to be migrating to JPRT but I'd not used it to this point. JPRT has a lot of nice features and it allows you to specify which builds you want. So I was able to make my mods to two files in the 2d source and tell JPRT to build me a linux jdk. So I learned how use JPRT and still managed to not build a jdk by myself.
After a few iterations of building instrumented jdks and staring at crashes I was able to track down an out of bound store in some 2d code. After looking at the code I was able to come up with what looked to be a good enough fix to at least eliminate the stomp and see what happened. The crash was gone. So my initial prediction turned out to be accurate and I turned the bug and all my notes over to someone that actually understands the code for a proper fix.
Now for the final (hard) lesson. After I had tracked this down to a simple oob store Tom and I wondered why the debug mallocs didn't find this. So Tom decided to try valgrind on the testcase. In very short order (like maybe an hour) it had located the same piece of code I had spent a good bit of time finding and one other suspicious routine. So I definitely learned respect for valgrind and will surely use it in cases like this in the future!
I think we'd benefit greatly from learning some of your
debugging techniques, especially with hard stuff like
memory stomping.
Dmitri
Java2D Team
Posted by Dmitri Trembovetski on June 11, 2007 at 01:36 PM EDT #
Rgds
Damon
PS. Any tiering news?
Posted by Damon Hart-Davis on June 11, 2007 at 01:39 PM EDT #
Posted by Alex Miller on June 11, 2007 at 01:42 PM EDT #
Comments are closed for this entry.