Thursday Mar 12, 2009

A problem I encountered a few weeks ago was the following:

$ svcs -l sendmail
fmri         svc:/network/smtp:sendmail
svcs: svcs.c:335: Unexpected libscf error: invalid argument. Exiting. 

"svcs -a" produced no error and the correct output whereas "svcs -xv" also produced the above error.  It didn't matter which SMF service we chose, the same problem was seen when using the '-l' or '-xv' flags.  The question we needed to answer was what function within libscf was returning this error?  The source code showed several places where SCF_ERROR_INVALID_ARGUMENT could be set.

We can see that /usr/bin/svcs links with libscf.so.1:

$ ldd /usr/bin/svcs
    libcontract.so.1 =>     /usr/lib/libcontract.so.1
    libscf.so.1 =>     /usr/lib/libscf.so.1
    libuutil.so.1 =>     /usr/lib/libuutil.so.1
    libumem.so.1 =>     /usr/lib/libumem.so.1
    libc.so.1 =>     /usr/lib/libc.so.1
    libnvpair.so.1 =>     /usr/lib/libnvpair.so.1
    libgen.so.1 =>     /usr/lib/libgen.so.1
    libnsl.so.1 =>     /usr/lib/libnsl.so.1
    libmp.so.2 =>     /usr/lib/libmp.so.2
    libmd.so.1 =>     /usr/lib/libmd.so.1
    libm.so.2 =>     /usr/lib/libm.so.2 

As a first attempt at troubleshooting the problem we'd like to use the PID provider to look at what functions within libscf are called during the execution and see which are returning SCF_ERROR_INVALID_ARGUMENT.  However the first attempt produced an error:

$ dtrace -n 'pid$target:libscf.so.1::return /pid == $target/ {printf("= %d\n", arg1);}' -c "svcs -l sendmail" -o /var/tmp/dtrace.svcs.out
dtrace: invalid probe specifier pid$target:libscf.so.1::entry /pid == $target/ {printf("= %d\n", arg1);}: probe description pid1153:libscf.so.1::entry does not match any probes 

The issue here is that DTrace isn't loading the libscf.so.1 even though /usr/lib is in the LD_LIBRARY_PATH.  The trick/workaround is to use the LD_PRELOAD variable to pre-load the library before DTrace attaches.

$ export LD_PRELOAD_32=/usr/lib/libscf.so.1

Note we use LD_PRELOAD_32 for 32bit libraries and LD_PRELOAD_64 for 64bit libraries.  In theory setting LD_PRELOAD should allow the compiler/linker to select the correct ELF library but in this case it didn't which is why we use LD_PRELOAD_32 to force 32bit.

Running the DTruss on a working host we now get:

$  dtrace -n 'pid$target:libscf.so.1::return /pid == $target/ {printf("= %d\n", arg1);}' -c "svcs -l sendmail" -o /var/tmp/dtrace.svcs.out
dtrace: description 'pid$target:libscf.so.1::return ' matched 429 probes
fmri         svc:/network/smtp:sendmail
name         sendmail SMTP mail transfer agent
enabled      true
state        online
next_state   none
state_time   15 February 2009 01:49:26 GMT
logfile      /var/svc/log/network-smtp:sendmail.log
restarter    svc:/system/svc/restarter:default
contract_id  84 
dependency   require_all/refresh file://localhost/etc/mail/sendmail.cf (online)
dependency   require_all/refresh file://localhost/etc/nsswitch.conf (online)
dependency   optional_all/none svc:/system/filesystem/autofs (online)
dependency   require_all/none svc:/system/filesystem/local (online)
dependency   require_all/none svc:/network/service (online)
dependency   require_all/refresh svc:/milestone/name-services (online)
dependency   optional_all/refresh svc:/system/identity:domain (online)
dependency   optional_all/none svc:/system/system-log (online)
dtrace: pid 1406 has exited


This produced a 651k output file.  A better way might be to look for any functions that return -1, or SCF_ERROR_INVALID_ARGUMENT.

$ dtrace -n 'pid$target:libscf.so.1::return /(pid == $target) && (arg1 == -1)/ {printf("= %d\n", arg1);}' -c "svcs -l sendmail" -o /var/tmp/dtrace.svcs.out

Unfortunately the original problem was never fully investigated as the host was rebuilt before root cause could be established.

I was working on an issue that wasn't reproducible at will but had a failure frequency low enough to dtruss.  The problem with this approach is that it can leave you with a large dataset.  Unless you're actively watching for errors in /var/adm/messages or application logs and can quickly cross-reference or mark the dtruss output the only option is to post-process the ELAPSD column and calculate where in the dtruss dataset you need to look.  Here lies the first issue.  By default dtruss doesn't provide the date & time when the dtruss was started.  Easily fixed by running "date; dtruss <options>" but not ideal.  The first enhancement I've added to Brendan's dtruss is to print a walltimestamp in the dtrace:::BEGIN to make it similar to the truss(1M) output.  The second enhancement I added is to embed the walltimestamp in the truss output as a new column, the first column, which makes finding the relevant bit of dtruss trivial. I've added the '-w' flag to dtruss, also active via the '-a' argument.

# USAGE: dtruss [-acdeflhoLsw] [-t syscall] { -p PID | -n name | command }
#
#               -p PID          # examine this PID
#               -n name         # examine this process name
#               -t syscall      # examine this syscall only
#               -a              # print all details
#               -c              # print system call counts
#               -d              # print relative timestamps (us)
#               -e              # print elapsed times (us)
#               -f              # follow children as they are forked
#               -l              # force printing of pid/lwpid per line
#               -o              # print on cpu times (us)
#               -s              # print stack backtraces
#               -L              # don't print pid/lwpid per line
#               -w              # print walltimestamp (system date/time) 
#               -b bufsize      # dynamic variable buf size (default is "4m") 

Here's a quick example of my dtruss running showing both enhancements:

$ pfexec  ./dtruss -a -p `pgrep nwamd`
dtruss started tracing at 2009 Mar 12 22:57:44 
TIMESTAMP              PID/LWP    RELATIVE  ELAPSD    CPU SYSCALL(args)          = return
2009 Mar 12 22:57:55    4310/11:        65 10010277     11 lwp_park(0x0, 0xFDFCEF3C, 0x0)         = -1 Err#62
2009 Mar 12 22:58:05    4310/11:       118 10009670     10 lwp_park(0x0, 0xFDFCEF3C, 0x0)         = -1 Err#62
^C

CALL                                        COUNT
lwp_park                                        3

You can download my updated version of dtruss here.  Enjoy.

This blog copyright 2009 by Steve Scargall