Friday July 23, 2004 Testing and Debugging. I am always interested in how engineers come to their final products. This may be an interesting case study of a problem I encountered recently.
I am reimplementing a driver responsible for AF_UNIX sockets. It should solve serious scalability problems with the current "tl" driver.
To test it I wrote a special DTrace script that traces all entry points to the driver and provides some useful information. When something goes wrong I can look at the trace and figure out what is happening.
To test the driver I've written some simple tests in C which perform basic operations with sockets: bind(), listen(), connect(), accept. Everything works fine. Then I looked at Perl Test::More module and decided to write some simple tests in Perl. Here is the full text of the test:
use strict;
use Fcntl;
use Socket;
use File::Temp;
use Test::More tests => 5;
my $type = SOCK_STREAM;
my $name = "SOCK_STREAM";
my $socket;
my $listener;
my $acceptor;
my $peer;
my $pname;
my $res;
my $sname = tmpnam();
socket(Server, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
socket (Client, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
socket (Acceptor, AF_UNIX, SOCK_STREAM, AF_UNSPEC) or die ("socket: $!");
unlink ($sname);
my $uaddr = sockaddr_un($sname);
$res = bind(Server, $uaddr) or die ("bind: $!");
ok($res, "bind");
$res = listen(Server, 1) or die ("listen: $!");
ok($res, "listen");
$res = connect (Client, $uaddr) or die ("connect: $!");
ok($res, "connect");
accept (Acceptor, Server) or die ("accept: $!");
unlink ($sname);
Suddenly the test failed on accept() with ESPIPE (illegal seek) error!
The C equivalent of the test worked fine. So I got two machines - one without
my driver and one with it and run the test with the monitoring D scripts.
Spending quite some time looking at the results, I didn't find any differences.
So I run truss and saw that accept actually succeeded, but for some reason Perl
version failed.Playing with truss(1) options I produced a huge file tracing calls to libc and libsocket. Everything looked fine with accept itself, but then, for some reason Perl version started to do completely different things. At that point I needed some other victim, so I headed to Bart and we started to look at these two truss outputs together. Bart spotted that the "good" run did malloc() of 17 bytes after accept while the bad one used malloc for one byte. This quickly lead us to conclusion that the problem is in the peer address returned in the accept(3SOCKET) call:
int accept(int s, struct sockaddr *addr, socklen_t *addrlen);
Indeed, a short C program immediately showed that returned addrlen was zero!Going back to the driver source everything seems normal. Some additional DTrace hacking showed exact message received by socket layer that should contain the peer address. Dumping the mblk in DTace shows that instead of options it contains a pointer! Another look at the source code reveals the problem:
t_scalar_t olen = creq->OPT_length;
t_scalar_t ooff = creq->DST_offset;
Here is the bug! Instead of using OPT_offset it uses DST_offset.
It took quite a while to figure this one out.Here is a snippet from the D that helped to track the problem:
fbt:tl:tl_conn_req:entry
{
this->q = args[0];
self->trace = 1;
this->mp = args[1];
this->connreq = (struct T_conn_req *)(this->mp)->b_rptr;
printf ("prim=%d, len=%d, off=%d, olen=%d, ooff=%d",
this->connreq->PRIM_type,
this->connreq->DEST_length,
this->connreq->DEST_offset,
this->connreq->OPT_length,
this->connreq->OPT_offset);
}
fbt::putnext:entry
/self->trace &&
args[1]->b_datap->db_type == 1 &&
*(int*)(args[1]->b_rptr) == 11/
{
self->trace = 0;
this->mp = args[1];
this->tci = (struct T_conn_ind*)this->mp->b_rptr;
this->opt = (struct T_opthdr *)(this->mp->b_rptr + this->tci->OPT_offset);
printf("conn_ind: mp=%p, sz=%d, ol=%d, oo=%d ",
args[1],
msgsize(this->mp),
this->tci->OPT_length,
this->tci->OPT_offset);
printf("olemn=%d, olev=%d, on=%d, os=%d",
this->opt->len,
this->opt->level,
this->opt->name,
this->opt->status);
}
I think it is a good example of how the process works inside. It really helps to use different tools at hand
(and extra pair of eyes!).
While playing with Perl I discovered that it uses fdopen() for all perl File objects, so it is impossible to open more then 256 open sockets (actually 253 if we account for STDIN, STDOUT and STDERR). Here C is still more powerful then Perl.
( Jul 23 2004, 08:24:32 PM PDT ) Permalink
Introduction I am an engineer in the Solaris kernel group.I was working on various projects, including: