Learn mdb in 30 minutes
As part of my involvement with the Blastwave community software project, I maintain SPARC and x86 versions of the Subversion packages for Solaris. When I became the maintainer for this software, I inherited with it a strange bug, which now is causing me problems. I currently use CVS for my day to day revision control needs, but I'd like to switch to SVN if only for the cool factor, but also to evaluate how it could eventually be used to replace the ClearCase repository my organization uses.
The issue is with the Apache 2 module mod_dav_svn, which enables SVN repository access via the WebDAV (distributed authoring and versioning) protocol. This is a great feature, as then it is possible to take advantage of modules like mod_auth_ldap to manage access control, or mod_perl to provide even more sophisticated functionality. The bug I was tracing occurs any time a file is committed to a SVN repository via DAV. The commit will fail, and the Apache error logs will contain a line or two like this:
[Thu Sep 30 16:00:10 2004] [notice] child pid 18115 exit signal Illegal instruction (4)
The Apache worker thread dies, and is replaced by another thread, waiting for a new request. (It seems to me that this should probably have a higher severity than notice) On the client side, the svn client outputs this useful message:
svn: Commit failed (details follow): svn: MKACTIVITY of /svn/!svn/act/7cba529c-56e5-0310-9002-f184ad8e84d2: Could not read status line: connection was closed by server. (http://server:5957)
So how was I going to track down what was happening with mod_dav_svn if it was being launched by Apache in response to an incoming commit request from my subversion client? I was going to learn how to use mdb, that's how. The first thing I did was learn how to launch a program in the debugger. It's easy: mdb /path/to/program. Apache requires a set of arguments, specifically the -X argument, which forces it to stay in the foreground and such. Finding out how to set arguments was a bit more difficult.
I'm somewhat familiar with the GNU debugger gdb, but this was my first time into mdb or any sort of Solaris debugging of this type (I've done very very low level hardware, firmware and software debugging on the Sun Fire mid-frame server line, but never applications). Using the mdb dcmd ::dcmds lists out the available commands. ::help <dcmd> for any dcmd will give the usage for that dcmd. I picked out the ::run dcmd, which is how arguments are passed into the target executable, and how execution is restarted from the top.
Now that I know this much, I can see what is happening to this module. Once the server is started with the ::run dcmd, I tried to commit a small change to the sample svn repository on my server. As soon as I connect, mdb dumps out with a SIGILL (Illegal Instruction). Using the ::stack dcmd prints out the stack trace:
bash$ mdb -uM /opt/csw/apache2/sbin/httpd Loading modules: [ libthread.so.1 libc.so.1 ] > ::run -X -f /opt/csw/apache2/etc/svn.conf mdb: stop on SIGILL mdb: target stopped at: 0x169f08: unimp 0 mdb: You've got symbols! Loading modules: [ ld.so.1 ] > ::stack 0x169f08(fca7bbec, 178010, 1, fff, 176340, ff36d518) mod_dav_svn.so`dav_svn_get_txn+0x44(177f90, 177eba, 0, 243f0, fedccdd4, fe62949c) mod_dav_svn.so`dav_svn_prep_activity+0xc(177ee0, 177eba, fe62aaac, fe60a5e4, 177ee0, 0) mod_dav_svn.so`dav_svn_get_resource+0x5a0(176378, 2f, 0, 0, fca7bda4, 177ee0) 0xfe705564(176378, 800, 0, fca7bda4, 27500, 0) 0xfe70f08c(176378, 1, fe62ac04, 0, 800, fe72c9fc) ap_invoke_handler+0x178(176378, fe710298, e313c, e2f94, 0, 0) ap_process_request+0x30(176378, 4, 176378, 0, fe4614a0, 0) 0x2aa60(152ca0, 0, 152cf8, 8ac00, 8ac00, 1000) ap_process_connection+0xc8(152ca0, 2aa04, e32a8, 0, 152c98, 8d9f0) 0x34790(f81f0, 152bb0, 18, 152ca0, 0, 18) libthread.so.1`_lwp_start(0, 0, 0, 0, 0, 0) >
From the stack trace, we can see that the issue is occuring in some unnamed function called by the mod_dav_svn function dav_svn_get_txn. Setting a breakpoint on this function can be done using the ::bp dcmd using the symbol name. Now we can try the commit again with a new breakpoint and see where it gets us:
> ::bp dav_svn_get_txn > ::run -X -f /opt/csw/apache2/etc/svn.conf mdb: stop at dav_svn_get_txn mdb: target stopped at: mod_dav_svn.so`dav_svn_get_txn: save %sp, -0x90, %sp mdb: You've got symbols!
Excellent. Now we have control over mod_dav_svn before it hits the illegal instruction. Mdb shows in the output that we have just saved off the stack pointer after dav_svn_get_txn is called by dav_svn_prep_activity. Looking in the actual source code for dav_svn_get_txn shows that the first steps in this function are to initialize and open a Berkeley DB (BDB) database, which is used by this installation of subversion to manage transactions in the repository. Mdb will now allow us to single step through the code using the short dcmd :s which steps one instruction at a time. If single stepping gets into a C library function or other function that is not critical to the current debugging task, specify :s out to step out of the function:
> :s mdb: target stopped at: libsvn_subr-1.so.0.0.0`svn_path_join+0x28: call +0x26d9c <PLT=libc.so.1`strlen> > :s mdb: target stopped at: libsvn_subr-1.so.0.0.0`svn_path_join+0x2c: nop > :s out mdb: stop at mod_dav_svn.so`dav_svn_get_txn+0x34 mdb: target stopped at: mod_dav_svn.so`dav_svn_get_txn+0x34: mov 1, %o2 >
This shows mdb stepping out of the strlen() function called from dav_svn_get_txn. Quite a bit of single stepping is needed to get to the actual error, as we call into the Apache support libraries eventually. I knew from some previous research that others had found the issue to be inside of the aprutil library, so I knew not to step out of any apr_ functions. One of the biggest pains I found with mdb was single stepping. What I really wanted to do was to break on a function and single step through everything within libaprutil (but only libaprutil), printing each disassembled line. I couldn't find out how to do this in 30 minutes using the Solaris Modular Debugger Guide, and I also couldn't find how to repeat the command I just sent to mdb. What I really needed, and continue to need, as I still haven't found the answer, is a command like gdb '.', which repeats the last command verbatim. With this, I would be able to type :s, then use a single character to step through lines, rather than two (50% efficiency improvement :). Eventually, after carpal tunnel started to set in, I hit the SIGILL:
> :s mdb: target stopped at: 0xff3515c0: ld [%o0 + 0xf0], %l3 > :s mdb: target stopped at: 0xff3515c4: jmpl %l3, %o7 > :s mdb: target stopped at: 0xff3515c8: clr %o1 > :s mdb: target stopped at: 0x169fc8: unimp 0 > ::dis -w 0x169fa0: unimp 0 0x169fa4: unimp 0 0x169fa8: unimp 0 0x169fac: unimp 0 0x169fb0: cb13 -0x440000 <0xffd29fb0> 0x169fb4: unimp 0 0x169fb8: unimp 0 0x169fbc: unimp 0 0x169fc0: unimp 0x69 0x169fc4: unimp 0 0x169fc8: unimp 0 <--- culprit 0x169fcc: unimp 0 0x169fd0: unimp 0 0x169fd4: unimp 2 0x169fd8: swapa [%g4 + -0x18] %asi, %i7 0x169fdc: swapa [%g5 + %l0] 03, %i7 0x169fe0: unimp 0x20 0x169fe4: unimp 0xa 0x169fe8: unimp 0 0x169fec: unimp 0 0x169ff0: unimp 0
Using the ::dis dcmd disassembles instructions near the current code pointer. The -w argument shows a window on either side of the current line. One more step and the process would throw a SIGILL and the apache request will die. A little more break point experimentation resulted in this call sequence:
mod_dav_svn.so`dav_svn_get_txn
libaprutil.so.0.9.5`apr_dbm_open
libaprutil.so.0.9.5`apr_posix_perms2mode
It looks like Forte generated a bit of bad code in libaprutil, which could be due to the fact that it is compiled with a high optimization level. This supported the claims of some users on the Subversion user mailing list who indicated that the issues were within libaprutil. The user had recompiled APR, Apache 2 and Subversion with GCC 3.4, and the SIGILL problem went away. The next step for me is to work with the apr and aprutil maintainer to produce some unoptimized versions of the APR libraries.


