Reflections on OS integration Eric Schrock's Weblog
Musings about Fishworks, Operating Systems, and the software that runs on them.

Wednesday Jun 30, 2004

So a few posts ago I asked for some suggestions on improving observability in Solaris, specifically with respect to LSOF. I thought I'd summarize the responses, which fell into two basic groups:

  1. Socket and process visibility. Something along the lines of lsof -i or netstat -p on Linux.
  2. Per-process mpstat, vmstat, and iostat.

I'll defer the first suggestion for the moment. The second suggestion is straightforward, thanks to the mystical powers of DTrace. As you can see from my previous post, it's simple to aggregate I/O on a per-process basis. Thanks to the vminfo and sysinfo DTrace providers, we can do the same for most any interesting statistic. The problem with traditional kstats1 is that they present static state after the fact - you cannot tell why or when a counter was incremented. But for every kstat reported by vmstat and mpstat, a DTrace probe exists wherever it's incremented. Throw in some predicates and aggregations, and we're talking instant observability.

I envision two forms of these tools. The first, as suggested in previous comments, would be present prstat(1) style output, sorted according to the user's choice of statistic. This would be aimed as administrators trying to understand systemic problems. The second form would take a pid and show all the relevant statistics for just that process. This would be aimed at developers trying to understand their application's behavior.

Today, anyone can write D scripts to do this. But there's something to be said for having a canned tool to jumpstart analysis. It doesn't have to be too powerful; once you get beyond these basic questions you'll be needing to write custom D scripts anyway. I'm sure the DTrace team has given this far more thought than I have, but I thought I'd let you know that your comments aren't descending into some kind of black hole. Blogging provides a unique forum for customer conversations; somewhere between a face to face meeting (which tends to not scale well) and a newsgroup posting (which lacks organization and personal attention). Many thanks to those in Sun who pushed for this new forum, and those of you out there reading and taking advantage of it.


1 The statistics used by these tools are part of the kstat(1M) facility. The kernel provides any number of statistics from every different subsystem, which can be extracted through a library interface and processed by user applications.

In my previous blog post, it was mentioned that it would be great to have a prstat-like tool for showing the most I/O hungry processes. As the poster suggested, this is definitely possible with the new io provider for DTrace. This will be available in the next Solaris Express release; see Adam's DTrace schedule for more information.

For fun, I hacked together a quick DTrace script as a proof of concept. Five minutes later, I had the following script:

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN
{
        printf("%-6s  %-20s  %s\n", "PID", "COMMAND", "BYTES/SEC");
        printf("------  --------------------  ---------\n");
        last = timestamp;
}

io:::start
{
        @io[pid, execname] = sum(args[0]->b_bcount);
}

tick-5sec
{
        trunc(@io, 10);
        printf("\n");
        normalize(@io, (timestamp - last) / 1000000000);
        printa("%-6d  %-20s  %@d\n", @io);
        trunc(@io, 0);
        last = timestamp;
}

This is truly a rough cut, but it will show the top ten processes issuing I/O, summarized every 5 seconds. Here's a little output right as as I kicked off a new build:

# ./iotop.d
PID     COMMAND               BYTES/SEC
------  --------------------  ---------

216693  inetd                 5376
100357  nfsd                  6912
216644  nohup                 7680
216689  make                  8192
0       sched                 14336
216644  nightly               20480
216710  sh                    20992
216651  newtask               46336
216689  make.bin              141824
216710  java                  1107712

216746  sh                    7168
216793  make.bin              8192
216775  make.bin              9625
216781  make.bin              13926
216767  make.bin              14745
216720  ld                    32768
216713  nightly               77004
216768  make.bin              78438
216740  make.bin              174899
216796  make.bin              193740

216893  make.bin              9011
216767  make.bin              9011
216829  make.bin              9830
216872  make.bin              9830
216841  make.bin              19046
216851  make.bin              21504
216907  make.bin              31129
216805  make.bin              54476
216844  make.bin              81920
216796  make.bin              117350
^C

#

In this case it was no surprise that 'make.bin' is doing the most I/O, but things could be more interesting on a larger machine.

While D scripts can answer questions quickly and effectively, you run out of rope pretty quickly when trying to write a general purpose utility. We will be looking into writing a new utility based off the C library interfaces1, where we can support multiple options and output formats, and massage the data into a more concise view. DTrace is still relatively new; in many ways it's like living your entire life inside a dome before finding the door to the outside. What would your first stop be? We're not sure ourselves2, so keep the suggestions coming!


1 The lockstat(1M) command is written using the DTrace lockstat provider, for example.

2 One thing that's for sure is Adam's work on userland static tracing and the plockstat utility. Stay tuned...