This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.
Back from a long holiday a collegue asked me to look at why a small c++ application was dying with SIGFPE on x86 boxes running Solaris 10. They had run dbx and truss and had worked out that it was taking a SIGFPE divide by zero trap on a idivl instruction deep in the flush of a i/o stream. The truss showed the fault as
Incurred fault #8, FLTIZDIV %pc = 0x0805065E
siginfo: SIGFPE FPE_INTDIV addr=0x0805065E
Received signal #8, SIGFPE [default]
siginfo: SIGFPE FPE_INTDIV addr=0x0805065E
So that would look like a divide by zero, dbx showed that the instruction was a idivl but the divisor register was not zero !
After a bit of looking at the AMD instruction documants we see that the idiv instruction can generate a "divide error" exception for two reasons - a divide by zero error and an integer overflow. The solaris kernel maps the "divide error" exception onto the FPE_INTDIV trap which truss reports but it could be caused by either cause. In this case we had an integer overlow as the result exceeded the capacity of a signed int. Now the folks who maintain the library that made the stream know to go look at their code.
I thought i was being clever with my algorithm for choosing new passwords, for reasons I can't remember this had "~" as the first character - BIG mistake. "~" is the escape character for lots of things like ssh and our service processor to host protocols.. good job I did not have "~#" or "~." as the first two characters! Something best avoided.
Turning on all kernel's TNF probes gathers you a big blob of data about what is going inside the kernel. Prior to Solaris 10 this is the only way to get accurate timing information for system calls. Recently, as in last night, I was trying to work out from one of these blobs why a write into a ufs filesystem might take a long time. I had the pid of the writing process so I could find all its threads, I could see one of them issue the write() system call, then I kept seeing the thread block and almost wake up before blocking again. It did this a number of times, obviously it was competing for a resource like a semaphore or a condvar or a mutex and not getting it. All tnfdump gives you is the address of the resource.
But if you use tnfdump -rx you see a bit more.
Here is what tnfdump gives you..
1995.449700 0.010800 10768 3 0x3002ec894a0 4 thread_block reason: 0x3002ec89632 stack:
0x63f8e8 : {
tnf_tag 0x22f8 thread_block
tnf_tag_arg 0x63f840
time_delta 0x4fd198
reason 0x3002ec89632
stack 0x63f900
}
0x63f900 : {
tnf_tag 0x2358 tnf_symbols
tnf_self_size 0x38
0 0x1011c800
1 0x100448cc
2 0x1007d5f8
3 0x1007d9e4
4 0x100adaa4
5 0x100adc68
}
So time for a modified tnfdump or another awk script to glue these things together so we can see why threads might block in the kernel.
so more on that when I have it working but until then if you have to gather kernel prex data, send in the raw output file from tnfxtract, a live dump or the nm of /dev/ksyms so we don't loose any information from the data.
Spending a lot of time writing test cases to try and reproduce system panics has led me to use an interesting(ish) methodology. You stare at the data structures in the dump, you stare at the code and see if you can work out how to get things into a similar state. Then comes the difficult bit, I've taken to working out what operations are possible from the userland code and then randomising them using lrand48().
So this weeks exercise has been to reproduce a panic in the poll() code that from code inspection is impossible. The per process file table (indexed by file descriptor) is per process but the poll structures are per lwp, so there is a linked list of interested threads attached to a file entry if any threads are polling on that file entry. In the dump there are 3 threads chained off one file entry. So we know that we have a multi threaded process performing poll() on a single file descriptor from several threads at once. We panic'ed in close as we traversed that list as one of the threads has been reused by a process that does no polling so its per thread poll structures were null. So now we know that it has threads exiting and threads closing the file that we are polling on.
So I wrote a threaded program that opened a net connection and then went into a loop, it randomly started a new thread, those threads then possibly polled on that connection, or possibly exited, or possibly closed the connection. The main thread dealt with all of this, re-opening the connection if it got closed, starting new threads as ones exited - all under the choice of lrand48().
Did this reproduce it ? No, so then I randomised the number and contents of the pollfd array passed to poll() and suddenly the machine paniced with an identical stack trace to the customer's machine - the power of lrand48()
The good news is that it is fixed in solaris 10 already...
Using the opensolaris source at opensolaris.org you can see the source code to sd.c, which builds the sd and ssd scsi/fibre target drivers.
The driver keeps a count of i/o's it has accepted in the un->un_ncmds_in_driver element of the un structure which is the softstate for that disk lun. It keeps the number of i/o's it has passed down to the HBA to transport in the un->un_ncmds_in_transport variable.
New I/O's passed into sd/ssd to the transport are passed straight down to the HBA until un_ncmds_in_transport exceeds the un_throttle for this lun or during error processing when we retry commands sequentually until one succeeeds.
When we have to queue the bufs we sort them onto a linked list anchored in un->un_waitq_headp and un->un_waitq_tailp.
But there is another place that i/o can be queued before it gets to the waitq and throttles - in the xbuf layer. un->un_xbuf_attr points to the xbuf control block which is of type __ddi_xbuf_attr . This has an unsorted waitq anchored in xa_headp and xa_tailp, by default the xbuf layer only allows 256 commands through to the normal sd waitq and throttles, the rest are queued in the xbuf and un_ncmds_in_driver is incremented. So it is possible when a large number of commands to be dropped on a sd lun, un_ncmds_in_driver is very large but the waitq has less than 256 commands on it - the rest are lurking on the xbuf's queue
These xbuf queued i/o are invisible to iostat's wait column - maybe they shouldn't be?
Here is something to think about...
If you are going to do a major change to a machine/application then it is worthwhile investing some time in getting some baseline measurements of the system's configuration/performance before you change anything. Exactly what is worth gathering depends on the machine but I see many escalations where we are forced to explain why after a major change the system appears to be slow but with no possibility to gather any data from before the changes.
An example of this is a customer moving to 64 bit oracle who suffers much slower i/o performance, luckily there are some snippets of sar data from before the change that allowed us to spot that the amount of i/o had greatly increased due to changes in the Oracle configuration
prepare for the worst and it won't happen.
I use this line in my .kshrc to debug ksh scripts
export PS4="ksh debug \$LINENO | "
enoexec ksh: ksh -x utds
ksh debug 33 | ksh debug 33 | uname -s
OS=SunOS
ksh debug 34 | [ SunOS = SunOS ]
ksh debug 36 | PSOPTS=-o ppid,pid,comm
ksh debug 49 | trap echo "$0: Interrupted "; exit 1 INT QUIT
ksh debug 51 | ksh debug 51 | basename utds
USAGE=Usage: utds { start | stop | restart }
ksh debug 54 | UTPRODINFO=/etc/opt/SUNWut/basedir/lib/utprodinfo
ksh debug 55 | [ -x /etc/opt/SUNWut/basedir/lib/utprodinfo ]
ksh debug 57 | ksh debug 57 | /etc/opt/SUNWut/basedir/lib/utprodinfo -p SUNWutdso BASEDIR
ksh debug 57 | 2> /dev/null
BASEDIR=/opt
ksh debug 60 | [ -z /opt ]
ksh debug 64 | PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/SUNWut/srds/lib:/opt/SUNWut/srds/sbin
ksh debug 65 | LD_LIBRARY_PATH=/opt/SUNWut/srds/lib:/home/timu/src/malloclib/sparc/libs
ksh debug 66 | export PATH LD_LIBRARY_PATH
ksh debug 67 | UTPUSHD=/opt/SUNWut/srds/lib/utpushd
ksh debug 68 | UTDSD=/opt/SUNWut/srds/lib/utdsd
ksh debug 77 | ExitCode=0
ksh debug 82 | [ ! -f /etc/opt/SUNWut/srds/current/utdsd.run ]
ksh debug 90 | [ -f /etc/opt/SUNWut/srds/current/utdsd.ini ]
ksh debug 92 | . /etc/opt/SUNWut/srds/current/utdsd.ini
utds[92]: /etc/opt/SUNWut/srds/current/utdsd.ini: not found
I have been drawn into an escalation involving the kernel dispatcher, this is the code that picks and schedules threads onto the available cpu's according to rules in the dispatch tables. It is by far the most complex code in the whole machine and IMHO it is the code that allows Solaris to scale so well.
In this escalation a load of threads including the clock are spinning waiting for another thread to release a dispatcher lock, sadly no one releases it and the machine hangs until the deadman or a cluster heartbeat stops the machine. Dispatcher locks are simple spin locks designed for easy use to protect modifications to the dispatching object when a thread is transitions from sleepq to runq to oncpu. As a thread moves from object to object the thread's t_lockp pointer is moved to point to the lock in the new object.
Now Solaris is quite complex internally especially when you considers doors, these are zero context switch function calls from a thread into another thread. Somewhere in all of the swtch/resume/door code we know that something is switching a thread t's t_lockp from one cpu lock to another cpu lock whilst at the same time ts_update_list() is doing thread_lock(t)/thread_unlock(t) so unlocks a different lock to the one it locked.
My job was made easier by some clever tracing code that a collegue of mine has developed that showed the lock and unlock of different locks, sadly it couldn't spot who manipulated t's t_lockp without having done it's own thread_lock(t) first. Code inspection has us looking at the door code and maybe the interupt pin/unpin code.
It would be good to extend the mechanism we have for statically checking locking sequences in storage drivers to deal with dispatcher locks, maybe that is something the opensolaris community could look at?
Procfs watchpoints are a little used but very powerful debugging tool.
They are documented in man -s4 proc .
They allow a program to be interrupted when it performs a memory
read/write/execute within a specified range of addresses. A simple bit
of code that detects writes to an integer could be based on this ..
--------------------------------------------
#include <sys/types.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/errno.h>
#include <ucontext.h>
#include <procfs.h>
#include <sys/uio.h>
struct thing {
char buffer[64];
int x;
char buffer1[512];
} thingy;
void
sub( int * x)
{
*x=getpid();
}
int
main(int argc, char * argv[])
{
iovec_t iov[2];
struct {
long cmd;
union {
long flags;
prwatch_t prwatch;
} arg;
} ctl;
int pfd;
pfd = open("/proc/self/ctl",O_WRONLY);
if (pfd == -1)
abort()
ctl.cmd = PCWATCH;
ctl.arg.prwatch.pr_wflags = WA_READ|WA_WRITE|WA_EXEC;
ctl.arg.prwatch.pr_vaddr = (caddr_t)&(thingy.x);
ctl.arg.prwatch.pr_size = sizeof(int);
printf("setting watchpoint at address 0x%p size %d\n",
ctl.arg.prwatch.pr_vaddr, ctl.arg.prwatch.pr_size);
(void)write(pfd, (char *)&ctl, sizeof (long)+sizeof (prwatch_t));
sub(&(thingy.x));
}
-------------------------------------------so compiling that into bar and running that under mdb produces..
hoofhearted ksh: mdb bar
> :r
setting watchpoint at address 0x8060de0 size 4
mdb: target stopped at:
sub+0x10: movl %edx,(%eax)
>
> $r
%cs = 0x003b %eax = 0x08060de0 bar`thingy+0x40
%ds = 0x0043 %ebx = 0xbfffb824
%ss = 0x0043 %ecx = 0xbff9da37 libc.so.1`getpid+7
%es = 0x0043 %edx = 0x00000603
%fs = 0x0000 %esi = 0x08047264
%gs = 0x01c3 %edi = 0x080473a0
%eip = 0x08050ab0 sub+0x10
%ebp = 0x08047208
%kesp = 0x00000000
%eflags = 0x00010202
id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
status=<of,df,IF,tf,sf,zf,af,pf,cf>
%esp = 0x08047200
%trapno = 0xe
%err = 0x6
>
So we took a SIGTRAP as soon as we tried to modify thingy.x. If the program had set the breakpoint on its own data then it would probably capture the signal, process it and return. If the breakpoint was set by a debugging process then the debugger would have registered an interest in that signal and could regain control at that point.
So I am using this in my debugging malloc library to watch the areas of the mmap that fall outside of the exact area requested in the call to malloc(). So far I have discovered that strlen has intimate knowledge of malloc and if the base address is aligned it uses a very clever algorithm to find the null terminating byte, but that causes it to use 32 bit load instructions to scan the string, so I have had to change the watchpoint at the end of the returned area to be a write watchpoint until we become 4 byte aligned and then to continue to the end of the page with a read/write watchpoint.
To avoid being accused of endianness I have finally got i86 and amd64 versions working -found a interesting little bug in the kernel whilst I was at it.
The procfs watchpoints work by clearing all permissions of any page that has any part of a watchpoint in it. Then when any access to the page happens we examine the avl tree of per process watch points to see if the accesed address is inside a watched area, if it is we send the SIGTRAP and setup to single step the process on return from the signal and set the page permissions back to normal. After the single instruction has accessed the memory we clear the page permissions and allow the process to run normally again. You can see why it is quite slow!
In the case of sparc the instructions size is 4 bytes and the data access size can be easily calculated from the instruction. On a x86 machine we have to decode the instruction thoroughly to find the instruction and data access size. The small bug I found caused that code to calculate the data access size for a pushl instruction in a 32 bit application running on an amd64 kernel as 8 bytes instead of the correct 4 bytes. So if the data being pushed onto the stack was the last 4 bytes before a watchpoint the kernel thought that you were accessing 8 bytes so triggered the watchpoint, took me all day to find it, I was sure it was my code.
Yes ! Finally after three whole days I have worked out the answer for the segv in malloc code. This program is a graphics application and it only segv's when run on a jfb framebuffer < big hint>. I sat down and read the malloc code - very clever, I should write it up to save others having the learning curve. We segv when we were carving up a big buffer area into two smaller ones, and importantly we were carving up a buffer called Bottom. This buffer contains all the unallocated space, when it runs out we call _morecore() which uses sbrk() to get more heap pages.
In this case early on we have a 19Mbyte malloc so the next call to _morecore() allocates a wee bit more than 0x01000000 bytes, this is important as it has a bit set in the top bye of the size. The size id kept in the 32 bits 8 bytes below the malloc/free pointer. Using dbx's watchi function I noticed that just as we segv the size of Bottom drops from 0x01001238 to 0x00001238 - maybe something is overwriting the bottom bye of the size.
Luckily I have the source to the standard malloc, so I turned it into a library and preloaded it -it still failed. So I used procfs's watchpoint functionality to put a write watchpoint across the size when the code recognised the address. Yes it took a SIGTRAP and it was whilst writing the null byte at the end of a strcpy. The classic sign of someone doing..
foo = malloc(strlen(string)); strcpy(foo, string);
This of course is wrong as it has not allocated room for the trailing NULL ( and it should have some error checking). This is a bit better...
foo = malloc(1 + strlen(string)); strcpy(foo, string);
So who was the culprit.. /usr/openwin/lib/GL/devhandlers/oglSUNWjfb.so.6 , time for a bug.
More worryingly why did no debugging malloc library not catch it? A combination of things, the space was not a multipleof 16 bytes so there was some padding at the end, I will get a simple testcase and log an rfe against libumem to get it to catch these issues.
Back to core dump running code
A collegue asked me to help debug a userland core file a couple of days ago. The stack showed the normal segv in a frame under realfree() that is so typical of a malloc arena corruption. The standard malloc keeps some housekeeping in the 8 bytes below the pointer that malloc returns, once you call free() those 8 bytes and the first few of the allocation are cast into an internal data structure and it is added to a b-tree sorted by size. It is not until another malloc/free causes a change in the tree that corruption is detected. As this can happen a long time after the free the program can segv long after the damage - another reason to use libumem or libwatchmalloc or ...
I thought it would be easy as there was a testcase, just load up libumem or my own debugging malloc library, run the testcase and point out the incorrect code. But it wouldn't fail with those. I interposed on malloc/free/calloc/valloc/memalign but I could see no obvious error like a double free. My current theory is that something is trampling over the structure after it has gone on the free list, time to become more of a dbx power user, or maybe write a mdb dcmd to verify the malloc arena at regular intervals.
Once that is done I can get back to putting the sparc emulator into my crash dump running program.
One off-beat program that comes in useful once or twice a year is a program I call Lazerus. Lazerus is fed a kernel crash dump and a configuration file and then starts executing the kernel code in the crash dump from a point defined in the configuration file. The registers are set up according to instructions in the configuration file. There is a SIGSEGV and a SIGBUS handler which use libkvm to page the correct pages from the crash dump into lazerus's 64 bit address space and then retry the execution. There are many things it can't do as it is running non- priviledged but it runs well enough to be able to watch code flow. I mainly use it to execute complex functions that have well defined actions that are too complex to do by hand, for example functions that calculate and follow a hash chain, where the hash function is based on some tunable and the hash chains are in the dump. It would take a long time to calculate exactly what hash chain was chosen and where it followed the chain to. With Lazerus you set up the registers like %g7 is the thread pointer, %i6 is the stack pointer and %o1..6 are the input arguments, set the pc and npc and then jump into the dump using setcontext().
So the first thing that happens is that a SIGSEGV is taken against the address in %pc, the handler mmap()s a 8k page of memory to an 8k aligned address that spans %pc, it then used kvm_read() to suck the contents of that page from the crash dump into the mmap()'ed space. When we return from the signal handler the kernel continues the program at %pc which now has the instructions copied from the kernel. Soon it will either drop of the end of the page and cause another page to be paged in, or it will branch/jump/call to a new address causing that page to come in. If it accesses data then we will fault against that address and pull it in, straight after pulling a page in we check to see if any of the memory patching directives in the configuration file should be applied to the page.
Version one of Lazerus used a complex method to detect the target of branches/jumps/calls and then used illegal instructions to jump to a handler to get access to the context after every instruction to allow single stepping.
Version two of Lazerus used a much simpler mechanism to perform single stepping. As soon as the 8k page of text was loaded procfs was used to put an execute watchpoint across the whole 8k page. The option to get the breakpoint signal after the instruction had executed was used so SIGTRAP was raised after every instruction allowing the context to be printed out. This worked much better than version one but some functions in the kernel caused an unexpected SIGSEGV, some investigation showed that the new libthread in solaris 9 and solaris 10 does a good job of ensuring that %g7 is set to the userland thread address causing any kernel function that uses CURRTHREAD to get the wrong data. Version two is much better but doesn't work.
Buried deep in my src/C directory is a program that emulates the more common sparc instructions. It has a software register file with 8 windows. It was used to investigate a problem with ld.so.1 which was almost impossible to debug until dtrace gave us the proc provider.
So the plan is to ditch the model of using a real sparc processor in userland to execute code from a crash dump and to use the software emulator ( there are several full function emulator available that are used to bringup new processors before silicon is ready but writing your own is always a good way to ensure you understand the instruction set) between each step and have it chew through the kernel code.
Todays goal, between working on escalations, was to get the code that processes the configuration file into decent shape. Over the years I have refined my option processing code, I tend not to use command lines as these can be forgotten but to put every thing in a configuration file so tidy code to process this file is essential. The current incarnation of option processing uses a structure to define the options, the type of data, the bounds and if necessary a function to call to affect the option, the actual code is only about 40 lines but the structure has all the work, hopefully it can be re-used.
Early next week I will get the kvm_read() code and context dumping code running, more then...
During the week I collected several escalations that required analysis of userland core files. In both cases I was supplied the core file and a showrev -p and asked for a quick analysis. So my choices were to either..
Point our jumpstart configuration scripts at the showrev-p output and install a machine so that I have all the right libraries in the right place. Machines are an expensive resource and it seems a waste to book and occupy a whole machine just for this.
Ask the customer for some more files from his machine. The binary that generated the core and then all the shared libraries listed in the output of pldd core. With those I can build a dbxrc that maps the loads of the libraries to use the customer's ones. Once I have these I can debug the core on any machine where dbx is available.
In solaris 10 coreadm gives you much better control over what parts of the faulting processes address space end up in the core file. The default will include the shared library segments so core files will be bigger but debugging them wont need all the other files.
All modern Sparc Sun machines use the PCI bus to connect peripherals to the system. A PCI bus is a physically addressed bus that can best be represented as an inverted tree. Each level of bus is connected to the next using pci bridge device. The bridge knows the phyical addresses of the devices below itself so it accepts transactions from the higher bus and passes them down. The root of the bus is the host-to-pci bus bridge.
This means that physical PCI bus addresses must be unique within a
tree of buses but can be duplicated between seperate trees. This means
that there must be a unique way of identifying which tree a pci command
is being sent to. Solaris does this by assigning segment of the SPARC
chips physical addressable space to each host-to-pci bridge.
If you use prtconf -vp
you can find the host-to-pci bus devices and all their properties.
On modern SPARC machine look for schizo devices, on slightly older ones
look for psycho nodes. You will find the ranges property , this is in 4
segments , one for configuration space, one for i/o space, one for 32
bit memory and one for 64 bit memory. Encoded in each is the base
address and the size.
ranges: base address size
config 00000000.00000000. 00000000.00000402.00000000. 00000000.01000000.
i/o 01000000.00000000. 00000000.00000402.01000000. 00000000.01000000.
mem32 02000000.00000000. 00000000.0000041c.00000000. 00000000.80000000.
mem64 03000000.00000000. 00000000.0000041c.00000000. 00000000.80000000
So when the PCI bus goes wrong the machine may panic with a PCI Bus
Error/BERR which causes the system to print out a fault address AFAR.
This is the SPARC physical address that caused the error, so to decide
on which bus the fault happenned match the AFAR to a segment supplied
by the ranges property of a host-to-pci bridge.
The next problem is to work out which device on that tree of buses was being accessed. So subtract the base of the host-to-pci bus segment we found above to get the PCI bus physical address that is unique on this tree of PCI buses. You then have to add the matching assigned-addresses property and the regs property then using the size from the assigned-addresses property you get a base and length to try and match.
type address sizeassigned-addresses: 82010010. 00000000.08000000. 00000000.01000000...
reg: 02010010. 00000000.00002000. 00000000.00800000...
sum 10 = mem32 00000000.08002000. 00000000.01000000
so this mem32 segment runs from PCI physical address 0x08002000 for length 0x01000000
It sounds easy but I have been writting code to automate this
for a couple of days now!