|
So a few posts ago I asked for some suggestions on improving observability in Solaris, specifically with respect to LSOF. I thought I'd summarize the responses, which fell into two basic groups:
- Socket and process visibility. Something along the lines of lsof -i or netstat -p on Linux.
- Per-process mpstat, vmstat, and iostat.
I'll defer the first suggestion for the moment. The second suggestion is straightforward, thanks to the mystical powers of DTrace. As you can see from my previous post, it's simple to aggregate I/O on a per-process basis. Thanks to the vminfo and sysinfo DTrace providers, we can do the same for most any interesting statistic. The problem with traditional kstats1 is that they present static state after the fact - you cannot tell why or when a counter was incremented. But for every kstat reported by vmstat and mpstat, a DTrace probe exists wherever it's incremented. Throw in some predicates and aggregations, and we're talking instant observability.
I envision two forms of these tools. The first, as suggested in previous comments, would be present prstat(1) style output, sorted according to the user's choice of statistic. This would be aimed as administrators trying to understand systemic problems. The second form would take a pid and show all the relevant statistics for just that process. This would be aimed at developers trying to understand their application's behavior.
Today, anyone can write D scripts to do this. But there's something to be said for having a canned tool to jumpstart analysis. It doesn't have to be too powerful; once you get beyond these basic questions you'll be needing to write custom D scripts anyway. I'm sure the DTrace team has given this far more thought than I have, but I thought I'd let you know that your comments aren't descending into some kind of black hole. Blogging provides a unique forum for customer conversations; somewhere between a face to face meeting (which tends to not scale well) and a newsgroup posting (which lacks organization and personal attention). Many thanks to those in Sun who pushed for this new forum, and those of you out there reading and taking advantage of it.
1 The statistics used by these tools are part of the kstat(1M) facility. The kernel provides any number of statistics from every different subsystem, which can be extracted through a library interface and processed by user applications.
In my previous blog post, it was mentioned that it would be great to have a prstat-like tool for showing the most I/O hungry processes. As the poster suggested, this is definitely possible with the new io provider for DTrace. This will be available in the next Solaris Express release; see Adam's DTrace schedule for more information.
For fun, I hacked together a quick DTrace script as a proof of concept. Five minutes later, I had the following script:
#!/usr/sbin/dtrace -s
#pragma D option quiet
BEGIN
{
printf("%-6s %-20s %s\n", "PID", "COMMAND", "BYTES/SEC");
printf("------ -------------------- ---------\n");
last = timestamp;
}
io:::start
{
@io[pid, execname] = sum(args[0]->b_bcount);
}
tick-5sec
{
trunc(@io, 10);
printf("\n");
normalize(@io, (timestamp - last) / 1000000000);
printa("%-6d %-20s %@d\n", @io);
trunc(@io, 0);
last = timestamp;
}
This is truly a rough cut, but it will show the top ten processes issuing I/O, summarized every 5 seconds. Here's a little output right as as I kicked off a new build:
# ./iotop.d
PID COMMAND BYTES/SEC
------ -------------------- ---------
216693 inetd 5376
100357 nfsd 6912
216644 nohup 7680
216689 make 8192
0 sched 14336
216644 nightly 20480
216710 sh 20992
216651 newtask 46336
216689 make.bin 141824
216710 java 1107712
216746 sh 7168
216793 make.bin 8192
216775 make.bin 9625
216781 make.bin 13926
216767 make.bin 14745
216720 ld 32768
216713 nightly 77004
216768 make.bin 78438
216740 make.bin 174899
216796 make.bin 193740
216893 make.bin 9011
216767 make.bin 9011
216829 make.bin 9830
216872 make.bin 9830
216841 make.bin 19046
216851 make.bin 21504
216907 make.bin 31129
216805 make.bin 54476
216844 make.bin 81920
216796 make.bin 117350
^C
#
In this case it was no surprise that 'make.bin' is doing the most I/O, but things could be more interesting on a larger machine.
While D scripts can answer questions quickly and effectively, you run out of rope pretty quickly when trying to write a general purpose utility. We will be looking into writing a new utility based off the C library interfaces1, where we can support multiple options and output formats, and massage the data into a more concise view. DTrace is still relatively new; in many ways it's like living your entire life inside a dome before finding the door to the outside. What would your first stop be? We're not sure ourselves2, so keep the suggestions coming!
1 The lockstat(1M) command is written using the DTrace lockstat provider, for example.
2 One thing that's for sure is Adam's work on userland static tracing and the plockstat utility. Stay tuned...
Over the years, we have been keeping an eye on the LSOF (LiSt Open Files) utility written by Vic Abell. It's a well respected utility that provides indispensible information that cannot be found any other way. We have tried to emulate some aspects of lsof, but we still fall short in several categories. As part of the RAS group, I'm curious to know what observability features our system is lacking. This is a huge open-ended question, so I'm trying to limit the scope by performing a direct comparison to lsof. If you're not familiar with lsof, then consider it a black box that will answer any questions about open files or connections on your system.
We have started to get better at this. The fuser(1M) and pfiles(1) commands are a decent start, especially since pfiles started having path information in Solaris 10. These two utilities still fail to address one major area of concern: systemic problems. We can tell you who has a given file open, and we can tell you which files a process has open, but we can't tell you who has NFS files open on your system or which processes are bound to a particular port. Running pfiles on every process in the system is not an acceptable solution, if only because it involves stopping every single process on your system. Dtrace can identify problems as they occur, but it can't give you a snapshot of the current state of the system.
I am not looking to clone lsof, but I do think we should be able to answer some of the questions that our customers are asking. This is not the first time this has come up within the Solaris group, but it is something I want to explore as we look beyond Solaris 10. I don't need to know all of lsof's features; I can read the manpages easily enough. So my question to you admins and developers out there is this: What specific questions does lsof allow you to answer that you otherwise cannot determine with the stock Solaris tools, and how does it make your life easier?
One of the most powerful but least understood aspects of the Solaris /proc implementation is what's known as the 'agent lwp'. The agent is a special thread that can be created on-demand by external processes. There are a few limitations: only one agent thread can exist in a process, the process must be fully stopped, and the agent cannot perform fork(2) or exec(2) calls. See proc(4) for all the gory details. So what's its purpose?
Consider the pfiles command. It's pretty easy to get the number of file descriptors for the process, and it's pretty easy to get their path information (in Solaris 10). But there's a lot of information there that can only be found through stat(2), fcntl(2), or getsockopt(3SOCKET). In this situation, we have generally three choices:
- Create a new system call. System calls are fast, but there aren't many of them, and they're generally reserved for something reasonably important. Not to mention the duplicated code and hairy locking problems.
- Expose the necessary information through /proc. This is marginally better than the above. We still have to write a good chunk of kernel code, but we don't have to dedicate a system call for it. On the other hand, we have to expose a lot of information through /proc, which means it's a public interface and will have to be supported for eternity. This is only done when we believe the information is useful to developers at large.
- Using the agent lwp, execute the necessary system call in the context of the controlled process.
For debugging utilities and various tools that are not performance critical, we typically opt for the third option above. Using the agent LWP and a borrowed stack, we do the following: First, we reserve enough stack space for all the arguments and throw in the necessary syscall intructions. We use the trace facilities of /proc to set the process running and wait for it to hit the syscall entry point. We then copy in our arguments, and wait for it to hit the syscall exit point. We then extract any altered values that we may need, clean up after ourselves, and get the return value of the system call.
If all of this sounds complicated, it's because it is. When you throw everything into the mix, it's about 450 lines of code to perform a basic system call, with many subtle factors to consider. To make our lives easier, we have created libproc, which includes a generic function to make any system call. libproc is extremely powerful, and provides many useful functions for dealing with ELF files and the often confusing semantics of /proc. Things like stepping over breakpoints and watchpoints can be extremely tricky when using the raw proc(4) interfaces. Unfortunately, the libproc APIs are private to Sun. Hopefully, one of my first tasks after the Solaris 10 crunch will be to clean up this library and present it to the world.
There are those among us who have other nefarious plans for the agent LWP. It's a powerful tool that I find interesting (and sometimes scary). Hopefully we can make it more accesible in the near future.
Those of you who have used a UNIX system before are probably familiar with the /proc filesystem. This directory provides a view of processes running on the system. Before getting into the gory details of the Solaris implementation (see proc(4) if you're curious), I thought I would go over some of the different variants over the years. You'll have to excuse any inaccuracies presented here; this is a rather quick blog entry that probably doesn't do the subject justice. Hopefully you'll be inspired to go investigate some of this on your own.
Eighth Edition UNIX
Tom Killian wrote the first implementation of /proc, explained in his paper1 published in 1984. It was designed to replace the venerable ptrace system call, which until then was used for primitive process tracing. Each process was a file in /proc, allowing the user to read and write directly to the file, rather than using ptrace's cumbersome single-byte transfers.
SVR4
The definitive /proc implementation, written by Roger Faulkner2 and Ron Gomes, explained in their paper3 published in 1991. This was a port of the Eighth Edition /proc, with some enhancements. The directory was still a flat directory, and each file supported read(), write(), and ioctl() interfaces. There were 37 ioctls total, including basic process control, signal/fault/syscall tracing, register manipulation, and status information. This created a powerful base for building tools such as ps without needing specialized system calls. Although useful, the system was not very user friendly, and was not particularly extensible. This system was brought into Solaris 2.0 with the move to a SVR4 base.
Solaris 2.6
The birth of the modern Solaris /proc, first conceived in 1992 and not fully implemented until 1996. This represented a massive restructuring of /proc; the most important change being that each pid now represented a directory. Each directory was populated with a multitude of files which removed the need for the ioctl interface. Process mappings, files, and objects can all be examined through calls to readdir() and read. Each LWP (thread) also has its own directory. The files are all binary files designed to be consumed by programs. The most interesting file is the /proc/<pid>/ctl file, which provides similar functionality of the old ioctl interfaces. The number of commands originally at 27, and they were much more powerful than their ioctl forbears. Very little has changed since this original implementation; only two new entries have been added to the directory, and only 4 new commands have been added. The tools built upon this interface, however, have changed dramatically.
4.4 BSD
The BSD kernel implements a version of procfs somewhere between the SVR4 version and the Solaris 2.6 version. Each process has its own directory, but there are only 8 entries in each directory, with the ability to access memory, registers, and current status. The control commands available are fairly primitive, allowing only for attach, detach, step, run, wait, and signal posting. In later derivatives (FreeBSD 4.10), the number of directory entries has expanded slightly to 12, though the control interfaces seem to have remained the same.
Linux
Linux takes a much different approach towards /proc. First of all, the Linux /proc contains a number of files and directories that don't directly relate to procceses. Some of these files are migrating to the /system directory in the 2.6 kernel, but /proc is still a dumping ground for all sorts of device and system-wide files. Secondly, all the files are all plaintext files; a major departure from the historical use of /proc. A good amount of information is available in the the /proc pid directory, but the majority of control is still done through interfaces such as ptrace. I will certainly spend some time to see how things like gdb and strace interact with processes.
We in Solaris designed /proc as a tool for developers to build innovative solutions, not an end-user interface. The Linux community believes that 'cat /proc/self/maps' is the best user interface, while we believe that pmap(1) is right answer. The reason for this is that mdb(1), truss(1), dtrace(1M) and a host of other tools all make use of this same information. It would be a waste of time to take binary information in the kernel, convert it to text, and then have the userland components all write their own (error prone) parsing routines to convert this information back into a custom binary form. Plus, we can change the options and output format of pmap without breaking other applications that depend on the contents of /proc. There are some very interesting ways in which we leverage this information which I'll cover in future posts.
1 T. J. Killian. Processes as Files. Proceedings of the USENIX Software Tools Users Group Summer Conference, pp 203-207, June 1984.
2 Roger Faulkner was then at Sun, and continues to work here in a one-man race to be the oldest kernel hacker on the planet. These days you can find him in Michigan, pounding away at the amd64 port.
3 R. Faulkner and R. Gomes. The Process File System and Process Model in UNIX System V. USENIX Conference Proceedings. Dallas, Texas. January 1991.
All you developers out there are probably well acquainted with corefiles. Every CS student has had a program try to dereference a NULL pointer at least once. Frequent use of assert(3c) causes your program to abort(3c) in exceptional circumstances. Unfortunately, too many developers know little else about corefiles. They are often used for nothing more than a stack backtrace or a quick session with dbx. We in the Solaris group take corefiles very seriously - they get the same amount of attention as crash dumps. Over the past few releases, we've added some great features to Solaris relating to corefiles that not everyone may be familiar with, including some really great stuff in Solaris 10. Here is a short list of some of the things that can make your life easier as a developer, especially when servicing problems from the field.
gcore
The gcore(1) command will generate a corefile from a running program - essentially a snapshot at that point in time. The process will continue on as if nothing has happened, crucial when an app is misbehaving in a non-fatal way and you don't want to resort to SIGABRT. Rather than trying to reproduce it or get access to the system while it is running, the customer can simply gcore the process and forward the corefile.
coreadm
This is a command that system administrators and developers alike should be familiar with. If you run a non-development server, processes should never coredump. Unless it's intentional (like sending SIGABRT or mucking about in /proc), every corefile produced is a bug. Admins can log all corefiles to a central location, so they know whom to blame when something goes wrong (usually us). Developers can have fine-grained control over the content of the corefile and where it gets saved. Having dozens of files named 'core' in every directory usually isn't the most helpful thing in the world.
corefile content
Starting in Solaris 10, we now have fine-grained control over the exact content of every corefile generated on the system (many thanks to Adam Leventhal for this). Read up on coreadm(1M) for all the gory details, but the most important thing is that we now have library text segments with the corefile. It used to be that if you got a core from a customer, you would need to find a matching version of every library they linked to in order to decipher what was going on. This made debugging complicated customer problems extremely difficult.
CTF data for libraries
We have supported a special form of debugging information known as CTF ("Compact Type Format") in the kernel since Solaris 9. We take the debugging information generated by the '-g' compiler flag, and strip out everything but the type information. It is stored in a compact format so it is suitable for shipping with production binaries. This information is enormously useful, so we added userland support for it in Solaris 10. MDB consumes this information, so you can do interesting things like ::print a socket structure from a core generated by your production app. Unfortunately, the tools used to convert this information from STABS are not publicly available yet, so you cannot add CTF data to your own application. We're working on it.
In the Solaris group, we take a great deal of pride in our debugging tools. Every tool we design (truss, DTrace, MDB, KMDB, ptools, libumem) helps reduce time to root cause from days to hours or minutes, in addition to solving previously unsolvable problems. Over the past week or two I've some conversations about the state of Linux debugging, especially kernel debugging. Since I haven't had much Linux experience, I thought I'd do a little research on our competitor. All I can say is that the Linux community doesn't seem to take the same pride in debugging tools as we do. In particular, I was shocked by the lack of post-mortem debugging tools available for the linux kernel. While this may not be immediately useful for you developers out there, I find it interesting, and later I will dig into post-mortem debugging for user processes.
Post mortem debugging begins with a 'dump'. For processes, this is a usually called a 'core' (most UNIX users have probably seen something like 'segmentation fault - core dumped' before). For kernels, it is called a 'crash dump'. When something bad happens the system will 'panic', and desperately try to stop the world and write out the current machine state while relying on as little kernel state as possible. The world of a panicking kernel is a very scary place; I won't go into how exactly this works. Suffice to say that after you reboot the system, you will end up with a file such as /var/crash/machine/{vmcore,unix}.11
These crashdumps are essential to us kernel engineers, and the failure of the linux community to recognize this2 is shocking. The reason that these dumps are so important is that they are often the only evidence we will ever have of the bug. In particular, we must be able to root-cause a problem from a single crash dump. If a production system panics for one of our Fortune 500 customers, we can't exactly go up to them and say "Hmmm, that's too bad. Can you run this instrumented kernel with these flags and make it happen again?" Especially since we've already cost our customer a lot of money3 by failing in the first place. We can try to reproduce it in development, but as Bryan points out, this can be an error-prone and often futile process. With post-mortem debugging, the customer simply sends us a crash dump and we do all our analysis on this file while their system continues to (hopefully) run happily in production.
This seemingly simple idea forms the foundation of Solaris debugging (really part of a trifecta now that DTrace and KMDB have arrived). On the Linux side, there is nothing built into the base kernel. When a Linux kernel 'oops' happens (their name for a panic), all you get to work with is a simple message, the current machine state, and a stack backtrace. As you will see, only the most trivial problems can be solved with this limited set of information. There is a patch to add this functionality, known as Linux Kernel Crash Dumps (LKCD). Surprising as it may seem, this patch has yet to make it into the base kernel. The politics around this are simply astounding. A good overview of why is covered here, with Linus's initial rejection here. The reasons seem largely political, but there is a definite statement that this doesn't belong in the base kernel - it should be a vendor-added "feature". The last patch available seems to be for kernel version 2.5.44, so the project may be falling into disrepair. Without crash dumps, your only recourse is to try and reproduce on an instrumented kernel. Even if you can reproduce it, this usually means inserting printf() statements: the lack of a coherent kernel debugger such as KMDB is a subject for a future post. My favorite quote is from this email, which states:
If it becomes apparent through empirical data that crash dumps are a useful tool, I'm sure that Linus will become far more amenable. Until then, lets let him handle all of his other work which needs to get done.
You need empirical evidence to prove that crash dumps are useful? Try looking at any of the tens of thousands of kernel bugs in our database. At the bottom of this post you'll find a random bug that I analyzed not that long ago. Don't worry if you don't have a clue what it means; it may not be comprehensible to anyone outside of the Solaris kernel group, but it's a nice show of MDB's power on a real-world bug.
Note that the line that glazes over "a little ::kgrep | ::whattype magic" is actually really interesting. ::kgrep will search for references to an address anywhere in the kernel. See Bryan Cantrill's AADEBUG paper for information on the gory details of ::typegraph and ::whattype. With their powers combined, I was able to find the lock address in the stack of the thread shown.
I've read a little bit of the LKCD documentation, and I'm not convinced that the bug below could have been diagnosed with their set of commands, and certainly it would have been a much more painful task. If you've made it this far through my post, you are probably totally confused and/or or screaming "but how can I benefit from this?" Well, MDB works just as well on user processes (and cores) as it does on kernel crash dumps. A single tool can debug kernel crash dumps, live kernels, process cores, and live processes. I look forward to exploring MDB in much more detail in future posts.
Bug excerpt:
A quick cursory glance shows the the following:
> ::walk thread | ::findstack ! munges
88 ################################## tp: 300072ebb40
cv_waituntil_sig+0x8c()
semop+0x548()
syscall_trap+0x88()
72 ################################## tp: 2a100017ce0
taskq_thread+0x110()
thread_start+4()
54 ################################## tp: 3000ee13ca0
rw_enter_sleep+0x144()
as_pageunlock+0x48()
default_physio+0x340()
pread+0x244()
syscall_trap+0x88()
We have 54 threads waiting for an address space lock. Not necessarily a
smoking gun, but a good place to start. All of these threads are hung up on
the following rwlock:
> 0x3000eda5358::print rwlock_impl_t
{
rw_wwwh = 0xb
}
So someone has a reader lock and we're all stuck waiting. Maybe we have a
recursive rw_enter(RW_READER) somewhere? This rwlock belongs to the following
address space:
> 3000eda5328::print -a struct as a_lock
{
3000eda5358 a_lock._opaque = [ 0xb ]
}
> 3000eda5328::as2proc | ::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 2673 1 2673 2673 30 0x00004000 0000030011f0e030
oracle_mt_x04_fb
So who owns the current reader lock? With a little ::kgrep | ::whattype magic
the owner appears to be:
> 30013e05220::findstack
stack pointer for thread 30013e05220: 2a1011a09a1
[ 000002a1011a09a1 sema_p+0x140() ]
000002a1011a0a51 biowait+0x60()
000002a1011a0b01 ufs_getpage_miss+0x2f0()
000002a1011a0c01 ufs_getpage+0x6a8()
000002a1011a0d61 fop_getpage+0x48()
000002a1011a0e31 segvn_fault+0x834()
000002a1011a0fe1 as_fault+0x4a0()
000002a1011a10f1 pagefault+0xac()
000002a1011a11b1 trap+0xc14()
000002a1011a12f1 utl0+0x4c()
>
In this thread, we've taken the address space lock in order to fault in a page,
but UFS is stuck in biowait(). Poking around, we see there are a bunch of
threads similarly stuck:
> ::walk thread | ::findstack ! grep biowait | wc -l
28
So now something's amiss in the I/O system, which has had a cascading effect on
the VM system, and all hell breaks loose. The particular buf that we're
interested in is:
> 30006a5a048::print struct buf b_vp | ::vnode2path
/dev/dsk/../../devices/ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@1,0:b
So we're doing some I/O to one of the disks (that's a shocker). There are several
more suspicious stacks in the dump with traces like:
> 2a1009d9ce0::findstack
stack pointer for thread 2a1009d9ce0: 2a1009d85c1
000002a1009d8671 swtch+0x78()
000002a1009d8721 turnstile_block+0x5f8()
000002a1009d87d1 mutex_vector_enter+0x3dc()
000002a1009d8881 aio_done+0x1c4()
000002a1009d8951 volkiodone+0x364()
000002a1009d8ad1 volsiodone+0x544()
000002a1009d8bb1 vol_subdisksio_done+0x68()
000002a1009d8c71 volkcontext_process+0x29c()
000002a1009d8d41 voldiskiodone+0x368()
000002a1009d8e41 ssd_return_command+0x198()
000002a1009d8ef1 ssdintr+0x224()
000002a1009d8fa1 qlc_fast_fcp_post+0x188()
000002a1009d9051 qlc_isr+0x378()
000002a1009d9151 qlc_intr+0xd0()
000002a1009d9221 pci_intr_wrapper+0x9c()
000002a1009d92d1 intr_thread+0x130()
000002a100969171 disp_getwork+0xe8()
This thread is blocked on the following mutex:
> 000002a1009d8881+7ff::print struct frame fr_arg[4]
fr_arg[4] = 0x3001967c060
Which is the aio_mutex for this aio structure:
> 0x3001967c000::print -a aio_t aio_mutex
{
3001967c060 aio_mutex._opaque = [ 0x30018878c21 ]
}
The owner of this mutex is here:
> 0x30018878c20::findstack
stack pointer for thread 30018878c20: 2a10175cd51
[ 000002a10175cd51 turnstile_block+0x5f8() ]
000002a10175ce01 rw_enter_sleep+0x144()
000002a10175ceb1 as_pageunlock+0x48()
000002a10175cf71 aphysio_unlock+0x6c()
000002a10175d021 aio_cleanup_cleanupq+0x2c()
000002a10175d0d1 aio_cleanup+0xc4()
000002a10175d181 aio_cleanup_thread+0x9c()
000002a10175d231 kaioc+0xe0()
000002a10175d2f1 syscall_trap+0x88()
>
Now it seems like we've finally found the culprit. aio_cleanup_cleanupq() takes
the aio_mutex lock like so:
[... source code removed ...]
We then call aphysio_unlock(). This particular aio has AIO_PAGELOCKDONE set, so
we go to unlock the page:
[... source code removed ...]
For as_pageunlock(), we need the address space lock.
The address space lock is held by thread 30013e05220, which is stuck waiting for
I/O to complete. Now we're in an ugly deadlock, since the I/O can't complete
while the as lock is held.
1 Essentially, we write out all of kernel memory to a reserved place on disk. Usually this is swap space, but it can be configured through dumpadm(1M). When the system reboots, savecore(1M) extracts this information and puts it in a more permanent location.
2 There are many people in the Linux community, so obviously this blanket statement doesn't apply to everyone. Whatever the local politics, I can't see how this wasn't implemented in Linux 0.1.0. Although it did take us until Solaris 7 to enable savecore(1M) by default.
3 We have customers that lose more than $100,000 per minute that their system is down. There is a reason why Solaris has to be bulletproof.
One of the most visible features that I have integrated into Solaris 10 is the ability to store pathnames with each open file1. This allows new avenues of observability that were previously inaccessible. First off, we simply have the files as symbolic links in /proc/<pid>/path:
$ ls -l /proc/`pgrep Firebird`/path | cut -b 55-
0 -> /devices/pseudo/mm@0:null
1 -> /home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0
10 -> /usr/local/MozillaFirebird/chrome/comm.jar
11 -> /usr/local/MozillaFirebird/chrome/en-US.jar
12 -> /usr/local/MozillaFirebird/chrome/embed-sample.jar
13 -> /usr/local/MozillaFirebird/chrome/pipnss.jar
14 -> /usr/local/MozillaFirebird/chrome/pippki.jar
15 -> /usr/local/MozillaFirebird/chrome/US.jar
16 -> /usr/local/MozillaFirebird/chrome/en-unix.jar
17 -> /usr/local/MozillaFirebird/chrome/classic.jar
18 -> /usr/local/MozillaFirebird/chrome/toolkit.jar
19 -> /usr/local/MozillaFirebird/chrome/browser.jar
2 -> /home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0
20
21
22 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_MAP_
23 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_001_
24 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_002_
25 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/Cache/_CACHE_003_
26
27 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/formhistory.dat
28 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/history.dat
29 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/cert8.db
3
30 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/key3.db
4 -> /var/run/name_service_door
5 -> /home/eschrock/.phoenix/default/7pkwqbju.slt/XUL.mfasl
6
7
8
9
a.out -> /usr/local/MozillaFirebird/MozillaFirebird-bin
cwd -> /home/eschrock
root -> /
ufs.102.0.11082 -> /usr/lib/iconv/646%UTF-16LE.so
ufs.102.0.11521 -> /usr/lib/iconv/UTF-16LE%646.so
[ ... output elided ... ]
$
As usual, mozilla firebird has lots of interesting stuff open. You may notice that some of the file descriptors have no path information. This is likely because they refer to a socket or FIFO (there is a small chance they refer to a file that has since been moved). The pfiles(1) command has been modified to use this information, so you can now see the path with the rest of the goodies:
$ pfiles `pgrep Firebird`
286670: /usr/local/MozillaFirebird/MozillaFirebird-bin
Current rlimit: 512 file descriptors
0: S_IFCHR mode:0666 dev:200,0 ino:6815752 uid:0 gid:3 rdev:13,2
O_RDONLY|O_LARGEFILE
/devices/pseudo/mm@0:null
1: S_IFREG mode:0644 dev:210,1281 ino:346 uid:138660 gid:10 size:4164
O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
/home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0
2: S_IFREG mode:0644 dev:210,1281 ino:346 uid:138660 gid:10 size:4164
O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
/home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0
3: S_IFIFO mode:0666 dev:209,0 ino:9 uid:0 gid:1 size:0
O_RDWR|O_NONBLOCK FD_CLOEXEC
4: S_IFDOOR mode:0444 dev:209,0 ino:52 uid:0 gid:0 size:0
O_RDONLY|O_LARGEFILE FD_CLOEXEC door to nscd[100253]
/var/run/name_service_door
5: S_IFREG mode:0644 dev:210,1281 ino:744 uid:138660 gid:10 size:747398
O_RDONLY|O_LARGEFILE
/home/eschrock/.phoenix/default/7pkwqbju.slt/XUL.mfasl
6: S_IFIFO mode:0000 dev:203,0 ino:119094 uid:138660 gid:10 size:0
O_RDWR|O_NONBLOCK
7: S_IFIFO mode:0000 dev:203,0 ino:119094 uid:138660 gid:10 size:0
O_RDWR|O_NONBLOCK
[ ... output elided ... ]
$
This should be enough to get most savvy sysadmins drooling. But wait, there's more!. This feature allowed the new DTrace io provider (integrated into build 60, aka Beta 5, aka SX 07/04) to get path name information for arbitrary files in the system. This allows you to do neat stuff like:
# cat iohog.d
#!/usr/sbin/dtrace -s
io:::start
{
@[execname, args[2]->fi_pathname] = sum(args[0]->b_bcount);
}
# ./iohog.d
^C
sched /home/eschrock/.dt/sessionlogs/machine_DISPLAY=:0 4096
xlp /var/adm/utmpx 4096
fsflush /export/iso/solaris_4.iso 73728
sched <none> 82432
cp <none> 114688
fsflush <none> 177152
cp /export/iso/solaris_4.iso 238936064
cp /export/iso/solaris_1.iso 239910912
#
For years we've had the iostat(1M) utility. It's great to know that someone is hammering away on sd0, but that's not really the question you want answered. What you really want to know is who is hammering away on your disks. With the DTrace io provider, we've taken it one step further by giving you the means to answer why someone is hammering away on your disks. All of a sudden one of the most opaque problems is now completely transparent. So head on over and check it out (while the io provider is not available in Solaris Express quite yet, the documentation for it is available on the DTrace page).
1 For the curious: Solaris implements a Virtual File System (VFS) layer, which includes the notion of a vnode to represent an abitrary file. The filesystem-dependent part is stored in a format private to the filesystem implementation (think of it in terms of inheritence if it helps). To illustrate with crude ASCII art:
USERLAND KERNEL VFS KERNEL FS
fd ----+----> file_t -----+----> vnode_t ------> inode_t /
| | prnode_t /
fd ----+ | etc
|
fd ---------> file_t -----+
We store a (char *) pointer at the end of the vnode_t when we go to look up the file, and now we have path information for all the open files in the kernel (even those implicitly mapped into process address space, without an associated file_t). There are some subtleties with hard links and moving files around, but it works perfectly 99% of the time, which is all we can hope for in this case.
Before I start looking at some of problems we're addressing in Solaris, I want to step back and examine one of the more fundamental problems I've been seeing in the industry. In order to develop more powerful software quickly, we insert layers of abstraction to distance ourselves from the actual implmentation (let someone else worry about it). There is nothing inherently wrong with this; no one is going to argue that you should write your business critical web service in assembly language instead of Java using J2EE. The problem comes from the disturbing trend that programmers are increasingly less knowledgeable about the layers upon which they build.
Most people can sit down and learn how to program Java or C, given enough time. The difference between an average programmer and a gifted programmer is the ability to truly understand the levels above and below where one works. For the majority of us1, this means understanding two things: our development platform and our customers. While understanding customer needs is a difficult task, a more tragic problem is the failure of programmers to understand their immediate development environment.
If you are a C programmer, it is crucial that you understand how virtual memory works, what exactly happens when you write to a file descriptor, and how threads are implemented. You should understand what a compiler does, how the dynamic linker works, and how the assembly code the compiler generates really works. If you are a Java programmer, you need to understand how garbage collection works, how java bytecodes are interpreted, and what JIT compiling really does. Just because you don't need to know the OS works doesn't mean you shouldn't. Knowing your environment encourages good software practice, as well as making you more effective at solving problems when they do occur.
Unfortunately, everything in the world is not somebody else's fault. As these new layers are added, things tend to become less and less observable. Not to mention that poor documentation can ruin an otherwise great tool. You may understand how a large number of cross calls can be the product of a misbehaving application, but if you can't determine where they're coming from, what's the point? We in the Solaris group develop tools to provide useful layers of abstraction, as well as tools that rip the hood off2 so you can see what's really happening inside.
Before I start examining some of these tools, I just wanted to point out that there is a large human factor involved that is out of our hands. The most powerful tool in the world can be useless in the hands of someone without a basic understanding of their system. Hopefully by providing these tools we can simultaneously expose the inner workings while sparking desire to learn about these inner workings. So take some some time to read a good book once in a while.
1For us kernel developers, the landscape is a little different. We have to work with a multitude of difference hardware (development platforms) in order to provide an effectively limitless number of solutions for our customers. I like to think that kernel engineers are pound-for-pound the best group of programmers around because of this, but maybe that's just my ego talking.
2Scott is in the habit of talking about how we sell cars, not auto parts. Does that mean we also provide the Jaws of Life to save you after you crash your new Enzo?
So begins my first blog post ever.
I have been a Solaris Kernel Engineer for 10 months now after graduating from my alma mater. Since I joined so late in the Solaris 10 development cycle, I have not had the pleasure of working on one of the larger S10 projects such as DTrace, Zones (N1 Grid Containers), FMA (Predictive Self Healing), or ZFS. But this has given me the unique opportunity to attack bits and pieces of Solaris from all directions. In particular, I have spent more than a few lonely nights with mdb, procfs, and the ptools. I've enjoyed growing up in this playground built by Mike Shapiro, Adam Leventhal, Roger Faulkner, and those who came before me. More recently, I have been drafted into service for the AMD64 (opteron) army, selflessly sacrificing my free time for the good of our porting effort.
From here, I will most likely continue to post about Solaris development as well as general software principles. You'll likely see a focus on software observability, debugging, and complexity. This comes with the territory, as you can see from Bryan's blog. It is not a coincidence that we kernel engineers share similar views and goals. It is an essential part of the philosophy that makes Solaris what it is today: a robust, reliable, manageable, serviceable, and observable operating system.
|
|