Code Complete
20040930 Thursday September 30, 2004

Learn mdb in 30 minutes

As part of my involvement with the Blastwave community software project, I maintain SPARC and x86 versions of the Subversion packages for Solaris. When I became the maintainer for this software, I inherited with it a strange bug, which now is causing me problems. I currently use CVS for my day to day revision control needs, but I'd like to switch to SVN if only for the cool factor, but also to evaluate how it could eventually be used to replace the ClearCase repository my organization uses.

The issue is with the Apache 2 module mod_dav_svn, which enables SVN repository access via the WebDAV (distributed authoring and versioning) protocol. This is a great feature, as then it is possible to take advantage of modules like mod_auth_ldap to manage access control, or mod_perl to provide even more sophisticated functionality. The bug I was tracing occurs any time a file is committed to a SVN repository via DAV. The commit will fail, and the Apache error logs will contain a line or two like this:

[Thu Sep 30 16:00:10 2004] [notice] child pid 18115 exit signal Illegal instruction (4)

The Apache worker thread dies, and is replaced by another thread, waiting for a new request. (It seems to me that this should probably have a higher severity than notice) On the client side, the svn client outputs this useful message:

svn: Commit failed (details follow):
svn: MKACTIVITY of /svn/!svn/act/7cba529c-56e5-0310-9002-f184ad8e84d2:
Could not read status line: connection was closed by server. (http://server:5957)

So how was I going to track down what was happening with mod_dav_svn if it was being launched by Apache in response to an incoming commit request from my subversion client? I was going to learn how to use mdb, that's how. The first thing I did was learn how to launch a program in the debugger. It's easy: mdb /path/to/program. Apache requires a set of arguments, specifically the -X argument, which forces it to stay in the foreground and such. Finding out how to set arguments was a bit more difficult.

I'm somewhat familiar with the GNU debugger gdb, but this was my first time into mdb or any sort of Solaris debugging of this type (I've done very very low level hardware, firmware and software debugging on the Sun Fire mid-frame server line, but never applications). Using the mdb dcmd ::dcmds lists out the available commands. ::help <dcmd> for any dcmd will give the usage for that dcmd. I picked out the ::run dcmd, which is how arguments are passed into the target executable, and how execution is restarted from the top.

Now that I know this much, I can see what is happening to this module. Once the server is started with the ::run dcmd, I tried to commit a small change to the sample svn repository on my server. As soon as I connect, mdb dumps out with a SIGILL (Illegal Instruction). Using the ::stack dcmd prints out the stack trace:

bash$ mdb -uM /opt/csw/apache2/sbin/httpd 
Loading modules: [ libthread.so.1 libc.so.1 ]
> ::run -X -f /opt/csw/apache2/etc/svn.conf
mdb: stop on SIGILL
mdb: target stopped at:
0x169f08:       unimp     0
mdb: You've got symbols!
Loading modules: [ ld.so.1 ]
> ::stack
0x169f08(fca7bbec, 178010, 1, fff, 176340, ff36d518)
mod_dav_svn.so`dav_svn_get_txn+0x44(177f90, 177eba, 0, 243f0, fedccdd4, fe62949c)
mod_dav_svn.so`dav_svn_prep_activity+0xc(177ee0, 177eba, fe62aaac, fe60a5e4, 177ee0, 0)
mod_dav_svn.so`dav_svn_get_resource+0x5a0(176378, 2f, 0, 0, fca7bda4, 177ee0)
0xfe705564(176378, 800, 0, fca7bda4, 27500, 0)
0xfe70f08c(176378, 1, fe62ac04, 0, 800, fe72c9fc)
ap_invoke_handler+0x178(176378, fe710298, e313c, e2f94, 0, 0)
ap_process_request+0x30(176378, 4, 176378, 0, fe4614a0, 0)
0x2aa60(152ca0, 0, 152cf8, 8ac00, 8ac00, 1000)
ap_process_connection+0xc8(152ca0, 2aa04, e32a8, 0, 152c98, 8d9f0)
0x34790(f81f0, 152bb0, 18, 152ca0, 0, 18)
libthread.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>

From the stack trace, we can see that the issue is occuring in some unnamed function called by the mod_dav_svn function dav_svn_get_txn. Setting a breakpoint on this function can be done using the ::bp dcmd using the symbol name. Now we can try the commit again with a new breakpoint and see where it gets us:

> ::bp dav_svn_get_txn
> ::run -X -f /opt/csw/apache2/etc/svn.conf
mdb: stop at dav_svn_get_txn
mdb: target stopped at:
mod_dav_svn.so`dav_svn_get_txn: save      %sp, -0x90, %sp
mdb: You've got symbols!

Excellent. Now we have control over mod_dav_svn before it hits the illegal instruction. Mdb shows in the output that we have just saved off the stack pointer after dav_svn_get_txn is called by dav_svn_prep_activity. Looking in the actual source code for dav_svn_get_txn shows that the first steps in this function are to initialize and open a Berkeley DB (BDB) database, which is used by this installation of subversion to manage transactions in the repository. Mdb will now allow us to single step through the code using the short dcmd :s which steps one instruction at a time. If single stepping gets into a C library function or other function that is not critical to the current debugging task, specify :s out to step out of the function:

> :s
mdb: target stopped at:
libsvn_subr-1.so.0.0.0`svn_path_join+0x28:      call      +0x26d9c    <PLT=libc.so.1`strlen>
> :s
mdb: target stopped at:
libsvn_subr-1.so.0.0.0`svn_path_join+0x2c:      nop
> :s out
mdb: stop at mod_dav_svn.so`dav_svn_get_txn+0x34
mdb: target stopped at:
mod_dav_svn.so`dav_svn_get_txn+0x34:    mov       1, %o2
>

This shows mdb stepping out of the strlen() function called from dav_svn_get_txn. Quite a bit of single stepping is needed to get to the actual error, as we call into the Apache support libraries eventually. I knew from some previous research that others had found the issue to be inside of the aprutil library, so I knew not to step out of any apr_ functions. One of the biggest pains I found with mdb was single stepping. What I really wanted to do was to break on a function and single step through everything within libaprutil (but only libaprutil), printing each disassembled line. I couldn't find out how to do this in 30 minutes using the Solaris Modular Debugger Guide, and I also couldn't find how to repeat the command I just sent to mdb. What I really needed, and continue to need, as I still haven't found the answer, is a command like gdb '.', which repeats the last command verbatim. With this, I would be able to type :s, then use a single character to step through lines, rather than two (50% efficiency improvement :). Eventually, after carpal tunnel started to set in, I hit the SIGILL:

> :s
mdb: target stopped at:
0xff3515c0:     ld        [%o0 + 0xf0], %l3
> :s
mdb: target stopped at:
0xff3515c4:     jmpl      %l3, %o7
> :s
mdb: target stopped at:
0xff3515c8:     clr       %o1
> :s
mdb: target stopped at:
0x169fc8:       unimp     0
> ::dis -w
0x169fa0:                       unimp     0
0x169fa4:                       unimp     0
0x169fa8:                       unimp     0
0x169fac:                       unimp     0
0x169fb0:                       cb13      -0x440000   <0xffd29fb0>
0x169fb4:                       unimp     0
0x169fb8:                       unimp     0
0x169fbc:                       unimp     0
0x169fc0:                       unimp     0x69
0x169fc4:                       unimp     0
0x169fc8:                       unimp     0      <--- culprit
0x169fcc:                       unimp     0
0x169fd0:                       unimp     0
0x169fd4:                       unimp     2
0x169fd8:                       swapa     [%g4 + -0x18] %asi, %i7
0x169fdc:                       swapa     [%g5 + %l0] 03, %i7
0x169fe0:                       unimp     0x20
0x169fe4:                       unimp     0xa
0x169fe8:                       unimp     0
0x169fec:                       unimp     0
0x169ff0:                       unimp     0

Using the ::dis dcmd disassembles instructions near the current code pointer. The -w argument shows a window on either side of the current line. One more step and the process would throw a SIGILL and the apache request will die. A little more break point experimentation resulted in this call sequence:

mod_dav_svn.so`dav_svn_get_txn
    libaprutil.so.0.9.5`apr_dbm_open
        libaprutil.so.0.9.5`apr_posix_perms2mode

It looks like Forte generated a bit of bad code in libaprutil, which could be due to the fact that it is compiled with a high optimization level. This supported the claims of some users on the Subversion user mailing list who indicated that the issues were within libaprutil. The user had recompiled APR, Apache 2 and Subversion with GCC 3.4, and the SIGILL problem went away. The next step for me is to work with the apr and aprutil maintainer to produce some unoptimized versions of the APR libraries.

Permalink Comments [2]

Upgrade Update

I spent some time last night second-guessing my decision to go with a Socket 939 solution instead of a Socket 754. The AMD64 processor line has an on-die memory controller, like the UltraSPARC-III and later processors. Socket 754 was the first to be introduced, with a single channel DDR controller, while earlier this year saw the first 939 processors and boards with dual-channel DDRCs. According to most reviews, the difference is not that large - a best case 3-5% performance improvement. However, there is a third AMD64 socket type - 940, which is used today by the Athlon FX series of processors and the server based Opteron chips. The plan is to migrate the FX series from 940 to 939, leaving the Opteron in the server space. The Athlon 64/64FX processors will all unified in the Socket 939 spec, targeted at workstations and enthusiasts. Socket 754 won't go away, as AMD will target this at the main-stream desktop market. The Sempron and Socket A will continue to serve the budget desktop market.

As Socket 939 is a relatively new introduction, processors and motherboards are still expensive. Most of the forum posts I've read on the topic suggest that the price/performance for Socket 939 just isn't there. Of course, most of these posts were from the start of this summer, and since then the price of an AMD64 3500+ has fallen from ~US$500 to ~US$330. The nearest Socket 754 chip comparable to a 3500+ is a 3400+, which comes in at ~US$290. The price difference of ~US$40 seems like a small amount to spend to get a better upgrade path, once AMD starts introducing dual-core processors (assuming that won't be another DDRC change requiring yet another socket change). I finally convinced myself that 939 was the way to go. Case 50% closed.

The other side of the equation is the motherboard. Socket 939 motherboards are also relatively inexpensive, compared to what I have spent in the past to upgrade my system. The MSI board listed in my last AMD Upgrade post is ~US$140, while some of the Socket 754 boards with similar functionality come in ~US$120. The difficult choice here was between the NVidia nForce3 250Gb chipset and the VIA K8T800. Every non-forum review of these two chipsets I have read concludes that both are equal. Anecdotal evidence from forums seems to suggest that nForce is more stable than VIA, but I've been using ABit boards based on VIA chipsets for years without stability issues. One of my chief concerns is support for the chipset in non-Windows environments like Linux and Solaris x86_64. VIA has generic drivers in the Linux kernel, so basic functionality can be had without having to resort to separate drivers. Functionality of nF3 based boards in Windows surpasses that of VIA, but it requires NVidia's proprietary drivers to operate in Linux. At this point, I'm not sure if either chipset is supported in Solaris.

My decision is now down to the final sticking point. Should I go with the VIA, and spare some of the performance and stability for driverless support under Linux? Or should I go with the nForce3 board, get superior Windows performance for games, and live with the fact that I will, yet again, be out on the cold, harsh, bleeding edge of technology.

If it were'nt for the fact that Newegg.com has been down for the last 12-14 hours, I probably would have already made the decision. The extra time I have to mull over the issue is certainly causing me to overthink. I can hardly believe that a site like Newegg.com, which is absolutely critical to their business (it is their business), is down for this length of time. Even a planned outage for the revenue producing face of this company should not even be noticed by the users, much less interrupt purchases and prevent new ones from accessing the store.

Maybe it would be a good idea for the Sun sales force to descend on Newegg and show them how e-business is really supposed to work.

Permalink

Give me caffeine or give me...

No, wait... I just want some coffee. The good graces of Sun keep my brain awash in caffeine from the time I arrive, until I collapse in a lifeless heap around 3:30PM from too much caffeine (not really). However, there are those who would interrupt my flow of this essential element, through their antisocial coffee etiquette faux pas. Here are the rules as I understand them -- there are few, but critical to survival (not really):

  1. When thou com'st upon an unfilled urn of coffee, fill't up.
  2. Thou shalt not drain the urn and not refill because thou shalt "miss thine meeting".
  3. If thine meeting is a grave matter, and lateness brings punishment, thou shalt drink of a lesser beverage (water?), or appoint an acolyte to fill the urn.
  4. Thou shalt not silently drain the urn, and convince thine self that it is truly not empty. It is.

Can't we all just get along?

Permalink Comments [1]