Paul Roberts' Musings

All | Esperanto | Films | General | Music | Security | Solaris | Technology
« Previous day (Jun 13, 2005) | Main | Next day (Jun 14, 2005) »
20050614 Tuesday June 14, 2005

A Simple Userland Bugfix

Now that OpenSolaris is up and running, we can talk more freely about the source code that we work on everyday. I thought I'd celebrate this launch by narrating the investigation of a simple userland bug fix that I worked on recently. Although there's nothing magical or ground-breaking here, it might still provide some insights into the bug-fixing process for those who aren't familiar with it, and to at least one small part of the Solaris code for those who are.

The bug was discovered in the Solaris implementation of one of the ancient and ever-ready Unix tools: xargs(1). The job of this tool is to collect arguments via the standard input and to bundle them up into argument lists to be passed to another tool. Its one of those commands that you probably think you'll never need, but every now and then you come across some situation where it does exactly what you want (and nothing else!). So many Unix tools fit that description!

For the most part, this tool works great, however, one user discovered that when given a specific kind of input, it wasn't behaving as expected. In the test case provided, the input to the command is quoted with a backslash character, meaning the spaces should be ignored (causing each line to be treated as a single argument). This input is passed to a command which prints each of its arguments on a line by itself. In this scenario, because each line is a single argument and each argument is printed on a single line, the input should be the same as the output. However, when given a sufficiently large input text, the output would slightly differ: new lines were added in certain places. Bug 6203159 was duely logged, and I was tasked with fixing it.

After getting a reproducible test case up and running, we can start digging a little further into exactly what is happening when the error occurs. We can recreate the problem with a simplified test case, using the -s flag to the xargs command to artificially limit the maximum length of the command line argument that xargs builds up (perhaps a clue about where the problem might lie...):

$ xargs printf "%s\n" <<EOF
> one\ line
> one\ line
> EOF
one line
one line
$ xargs -s 26 printf "%s\n" <<EOF
> one\ line
> one\ line
> EOF
one line
one
line

The output of these two invocations should be the same, however we can see that in the second case limiting the length of the created command line corrupts the output. Lets use truss(1) to find out exactly what commands xargs is executing (-t lets us just trace the exec(2) system call, -f says to also trace child processes, and -a prints out the arguments passed to exec-like system calls):

$ truss -fat exec -o truss.log xargs <testdata -s 26 printf "%s\n" >/dev/null
$ grep -v ENOENT truss.log                                         
14433:  execve("/usr/bin/xargs", 0xFFBFF3E4, 0xFFBFF3FC)  argc = 5
14433:   argv: xargs -s 26 printf %s\n
14437:  execve("/bin/printf", 0x00024B7C, 0xFFBFF3FC)  argc = 3
14437:   argv: printf %s\n one line
14442:  execve("/bin/printf", 0x00024B7C, 0xFFBFF3FC)  argc = 4
14442:   argv: printf %s\n one line

Now, this is very interesting, because we can see that in the second invocation of printf(1) we are passing 4 arguments instead of three (the 'argc = 4' text shows this) so the words 'one' and 'line' must be being sent in the second case as separate arguments: the '\' isn't doing its job. So some where in the part of the code where xargs builds up its list of arguments, we're losing this special character.

With a good idea of where the problem is occuring, we can turn to the code. xargs builds up its argument list character by character within the getarg() function. The main loop here starts at the follwing lines:

	for (arg = next; ; c = getwchr()) {
 		bytes = wctomb(mbc, c);
        ...

so lets grab our trusty debugger (in this case dbx), get the thing running within it, and stop execution at the second line above. We can then watch the argument build up for a while:

(dbx) runargs <testdata -s 28 printf "%s"                                    
(dbx) stop at 548                        
(2) stop at "xargs.c":548
(dbx) run       
Running: xargs -s 26 printf %s < testdata 
(process id 2244)
stopped in getarg at line 548 in file "xargs.c"
  548                   bytes = wctomb(mbc, c);
(dbx) display arg
arg = 0x268a2 ""
(dbx) cont      
stopped in getarg at line 548 in file "xargs.c"
  548                   bytes = wctomb(mbc, c);
arg = 0x268a2 "o"
(dbx) cont
stopped in getarg at line 548 in file "xargs.c"
  548                   bytes = wctomb(mbc, c);
arg = 0x268a2 "on"
(dbx) cont
stopped in getarg at line 548 in file "xargs.c"
  548                   bytes = wctomb(mbc, c);
arg = 0x268a2 "one"

So we can see the string being built up bit by bit. Lets skip on to the second line just before our magic '\' is read in:

(dbx) cont
stopped in getarg at line 548 in file "xargs.c"
  548                   bytes = wctomb(mbc, c);
arg = 0x268ab "one"
(dbx) next
stopped in getarg at line 551 in file "xargs.c"
  551                   if ((next + ((bytes == -1) ? 1 : bytes)) >= &argbuf[BUFLIM]) {
arg = 0x268ab "one"
(dbx) print c
c = '\\'

So we can see from the output of 'print c' that our current character is indeed our slash. What happens to it? Well a switch statement is run on the character and eventually it is discovered to be a quote char:

                  case L'\\':
                         c = getwchr();

at this point, our 'c' is actually thrown away and the character following it is read in instead. This is how the quoting is implemented because this happens after a character's special meaning is determined, meaning the space character (in this case) gets copied directly to the argument, and the '\' is removed:

(dbx) cont
stopped in getarg at line 548 in file "xargs.c"
  548                   bytes = wctomb(mbc, c);
arg = 0x268ab "one "

So far so good. But where does it all go wrong? Lets follow the execution a bit further. At some point, we reach the limit of our buffer (which, if you remember, we have artificially shortened with the '-s' switch) and when that happens, we enter the code within the following if statement:

if ((next + ((bytes == -1) ? 1 : bytes)) >= &argbuf[BUFLIM]) {
...

At that point, we have a nice comment that describes what happens next:

/*
 * Otherwise we put back the current argument
 * and use what we have collected so far...
 */

So wait a minute, alarm bells are ringing now: when we can no longer read any more input we requeue what we have read in so far so that it can be re-read in the next iteration:

...
(void) strcpy(so_far, arg);
...
queue(so_far, len, HEAD);

Hang on a minute, haven't we removed the '\' from the argument that we have built up? Yes, that means when we perform this requeing, we lose the quote character and just requeue a lonely space, meaning when that text is re-read the space will be active, causing the symptoms we've seen here.

So that's it, after a lot of digging we have worked out what is causing the problem. As for how it was fixed, well that's another story. Why not dig through the OpenSolaris code and answer that question for yourself?

Technorati Tag: OpenSolaris Technorati Tag: Solaris ( Jun 14 2005, 06:02:52 PM BST ) Permalink Comments [0]

Calendar

RSS Feeds

Search

Links

Navigation

Referers