inside the sausage factory Adam Leventhal's Weblog
Adam Leventhal, Fishworks engineer

Wednesday Nov 19, 2008

The Sun Storage 7410 is our expandable storage appliance that can be hooked up to anywhere from one and twelve JBODs with 24 1TB disks. With all those disks we provide the several different options for how to arrange them into your storage pool: double-parity RAID-Z, wide-strip double-parity RAID-Z, mirror, striped, and single-parity RAID-Z with narrow stripes. Each of these options has a different mix of availability, performance, and capacity that are described both in the UI and in the installation documentation. With the wide array of supported configurations, it can be hard to know how much usable space each will support.

To address this, I wrote a python script that presents a hypothetical hardware configuration to an appliance and reports back the available options. We use the logic on the appliance itself to ensure that the results are completely accurate as the same algorithms would be applied as when then the physical pallet of hardware shows up. This, of course, requires you to have an appliance available to query — fortunately, you can run a virtual instance of the appliance on your laptop.

You can download the sizecalc.py here; you'll need python installed on the system where you run it. Note that the script uses XML-RPC to interact with the appliance, and consequently it relies on unstable interfaces that are subject to change. Others are welcome to interact with the appliance at the XML-RPC layer, but note that it's unstable and unsupported. If you're interested in scripting the appliance, take a look at Bryan's recent post. Feel free to post comments here if you have questions, but there's no support for the script, implied, explicit, unofficial or otherwise.

Running the script by itself produces a usage help message:

$ ./sizecalc.py
usage: ./sizecalc.py [ -h <half jbod count> ] <appliance name or address>
    <root password> <jbod count>
Remember that you need a Sun Storage 7000 appliance (even a virtual one) to execute the capacity calculation. In this case, I'll specify a physical appliance running in our lab, and I'll start with a single JBOD (note that I've redacted the root password, but of course you'll need to type in the actual root password for your appliance):
$ ./sizecalc.py catfish ***** 1
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      11       2            22                18
raidz2 wide    False      23       1            23                21
mirror         False       2       2            22                11
stripe         False       0       0            24                24
raidz1         False       4       4            20                15
Note that with only one JBOD no configurations support NSPF (No Single Point of Failure) since that one JBOD is always a single point of failure. If we go up to three JBODs, we'll see that we have a few more options:
$ ./sizecalc.py catfish ***** 3
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      13       7            65                55
raidz2          True       6       6            66                44
raidz2 wide    False      34       4            68                64
raidz2 wide     True       6       6            66                44
mirror         False       2       4            68                34
mirror          True       2       4            68                34
stripe         False       0       0            72                72
raidz1         False       4       4            68                51
In this case we have to give up a bunch of capacity in order to attain NSPF. Now let's look at the largest configuration we support today with twelve JBODs:
$ ./sizecalc.py catfish ***** 12
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      14       8           280               240
raidz2          True      14       8           280               240
raidz2 wide    False      47       6           282               270
raidz2 wide     True      20       8           280               252
mirror         False       2       4           284               142
mirror          True       2       4           284               142
stripe         False       0       0           288               288
raidz1         False       4       4           284               213
raidz1          True       4       4           284               213

The size calculator also allows you to model a system with Logzilla devices, write-optimized flash devices that form a key part of the Hybrid Storage Pool. After you specify the number of JBODs in the configuration, you can include a list of how many Logzillas are in each JBOD. For example, the following invocation models twelve JBODs with four Logzillas in the first 2 JBODs:

$ ./sizecalc.py catfish ***** 12 4 4
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      13       7           273               231
raidz2          True      13       7           273               231
raidz2 wide    False      55       5           275               265
raidz2 wide     True      23       4           276               252
mirror         False       2       4           276               138
mirror          True       2       4           276               138
stripe         False       0       0           280               280
raidz1         False       4       4           276               207
raidz1          True       4       4           276               207

A very common area of confusion has been how to size Sun Storage 7410 systems, and the relationship between the physical storage and the delivered capacity. I hope that this little tool will help to answer those questions. A side benefit should be still more interest in the virtual version of the appliance — a subject I've been meaning to post about so stay tuned.

Update December 14, 2008: A couple of folks requested that the script allow for modeling half-JBOD allocations because the 7410 allows you to split JBODs between heads in a cluster. To accommodate this, I've added a -h option that takes as its parameter the number of half JBODs. For example:

$ ./sizecalc.py -h 12 192.168.18.134 ***** 0
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      14       4           140               120
raidz2          True      14       4           140               120
raidz2 wide    False      35       4           140               132
raidz2 wide     True      20       4           140               126
mirror         False       2       4           140                70
mirror          True       2       4           140                70
stripe         False       0       0           144               144
raidz1         False       4       4           140               105
raidz1          True       4       4           140               105

Update February 4, 2009: Ryan Matthews and I collaborated on a new version of the size calculator that now lists the raw space available in TB (decimal as quoted by drive manufacturers for example) as well as the usable space in TiB (binary as reported by many system tools). The latter also takes account of the sliver (1/64th) reserved by ZFS:

$ ./sizecalc.py 192.168.18.134 ***** 12
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
raidz2       False     14      8          280         240.00         214.87
raidz2        True     14      8          280         240.00         214.87
raidz2 wide  False     47      6          282         270.00         241.73
raidz2 wide   True     20      8          280         252.00         225.61
mirror       False      2      4          284         142.00         127.13
mirror        True      2      4          284         142.00         127.13
stripe       False      0      0          288         288.00         257.84
raidz1       False      4      4          284         213.00         190.70
raidz1        True      4      4          284         213.00         190.70

Update June 17, 2009: Ryan Matthews with help from has again revised the size calculator to model both adding expansion JBODs and to account for the now expandable Sun Storage 7210. Take a look at Ryan's post for usage information. Here's an example of the output:

$ ./sizecalc.py 172.16.131.131 *** 1 h1 add 1 h add 1 
Sun Storage 7000 Size Calculator Version 2009.Q2
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      5           42          21.00          18.80
raidz1       False      4     11           36          27.00          24.17
raidz2       False  10-11      4           43          35.00          31.33
raidz2 wide  False  10-23      3           44          38.00          34.02
stripe       False      0      0           47          47.00          42.08

Update September 16, 2009: Ryan Matthews updated the size calculator for the 2009.Q3 release. The update includes the new triple-parity RAID wide stripe and three-way mirror profiles:

$ ./sizecalc.py boga *** 4
Sun Storage 7000 Size Calculator Version 2009.Q3
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      4           92          46.00          41.18
mirror        True      2      4           92          46.00          41.18
mirror3      False      3      6           90          30.00          26.86
mirror3       True      3      6           90          30.00          26.86
raidz1       False      4      4           92          69.00          61.77
raidz1        True      4      4           92          69.00          61.77
raidz2       False     13      5           91          77.00          68.94
raidz2        True      8      8           88          66.00          59.09
raidz2 wide  False     46      4           92          88.00          78.78
raidz2 wide   True      8      8           88          66.00          59.09
raidz3 wide  False     46      4           92          86.00          76.99
raidz3 wide   True     11      8           88          64.00          57.30
stripe       False      0      0           96          96.00          85.95

** As of 2009.Q3, the raidz2 wide profile has been deprecated.
** New configurations should use the raidz3 wide profile.

Monday Nov 10, 2008

The Sun Storage 7000 Series launches today, and with it Sun has the world's first complete product that seamlessly adds flash into the storage hierarchy in what we call the Hybrid Storage Pool. The HSP represents a departure from convention, and a new way of thinking designing a storage system. I've written before about the principles of the HSP, but now that it has been formally announced I can focus on the specifics of the Sun Storage 7000 Series and how it implements the HSP.

Sun Storage 7410: The Cadillac of HSPs

The best example of the HSP in the 7000 Series is the 7410. This product combines a head unit (or two for high availability) with as many as 12 J4400 JBODs. By itself, this is a pretty vanilla box: big, economical, 7200 RPM drives don't win any races, and the maximum of 128GB of DRAM is certainly a lot, but some workloads will be too big to fit in that cache. With flash, however, this box turns into quite the speed demon.

Logzilla

The write performance of 7200 RPM drive isn't terrific. The appalling thing is that the next best solution — 15K RPM drives — aren't really that much better: a factor of two or three at best. To blow the doors off, the Sun Storage 7410 allows up to four write-optimized flash drives per JBOD each of which is capable of handling 10,000 writes per second. We call this flash device Logzilla.

Logzilla is a flash-based SSD that contains a pretty big DRAM cache backed by a supercapacitor so that the cache can effectively be treated as nonvolatile. We use Logzilla as a ZFS intent log device so that synchronous writes are directed to Logzilla and clients incur only that 100μs latency. This may sound a lot like how NVRAM is used to accelerate storage devices, and it is, but there are some important advantages of Logzilla. The first is capacity: most NVRAM maxes out at 4GB. That might seem like enough, but I've talked to enough customers to realize that it really isn't and that performance cliff is an awful long way down. Logzilla is an 18GB device which is big enough to hold the necessary data while ZFS syncs it out to disk even running full tilt. The second problem with NVRAM scalability: once you've stretched your NVRAM to its limit there's not much you can do. If your system supports it (and most don't) you can add another PCI card, but those slots tend to be valuable resources for NICs and HBAs, and even then there's necessarily a pretty small number to which you could conceivably scale. Logzilla is an SSD sitting in a SAS JBOD so it's easy to plug more devices into ZFS and use them as a growing pool of intent log devices.

Readzilla

The standard practice in storage systems is to use the available DRAM as a read cache for data that is likely to be frequently accessed, and the 7000 Series does the same. In fact, it can do quite a better job of it because, unlike most storage systems which stop at 64GB of cache, the 7410 has up to 256GB of DRAM to use as a read cache. As I mentioned before, that's still not going to be enough to cache the entire working set for a lot of use cases. This is where we at Fishworks came up with the innovative solution of using flash as a massive read cache. The 7410 can accomodate up to six 100GB, read-optimized, flash SSDs; accordingly, we call this device Readzilla.

With Readzilla, a maximum 7410 configuration can have 256GB of DRAM providing sub-μs latency to cached data and 600GB worth of Readzilla servicing read requests in around 50-100μs. Forgive me for stating the obvious: that's 856GB of cache &mdash. That may not suffice to cache all workloads, but it's certainly getting there. As with Logzilla, a wonderful property of Readzilla is its scalability. You can change the number of Readzilla devices to match your workload. Further, you can choose the right combination of DRAM and Readzilla to provide the requisite service times with the appopriate cost and power use. Readzilla is cheaper and less power-hungry than DRAM so applications that don't need the blazing speed of DRAM can prefer the more economical flash cache. It's a flexible solution that can be adapted to specific needs.

Putting It All Together

We started with DRAM and 7200 RPM disks, and by adding Logzilla and Readzilla the Sun Storage 7410 also has great write and read IOPS. Further, you can design the specific system you need with just the right balance of write IOPS, read IOPS, throughput, capacity, power-use, and cost. Once you have a system, the Hybrid Storage Pool lets you solve problems with targeted solutions. Need capacity? Add disk. Out of read IOPS? Toss in another Readzilla or two. Write bogging down? Another Logzilla will net you another 10,000 write IOPS. In the old model, of course, all problems were simple because the solution was always the same: buy more fast drives. The HSP in the 7410 lets you address the specific problem you're having without paying for a solution to three other problems that you don't have.

Of course, this means that administrators need to better understand the performance limiters, and fortunately the Sun Storage 7000 Series has a great answer to that in Analytics. Pop over to Bryan's blog where he talks all about that feature of the Fishworks software stack and how to use it to find performance problems on the 7000 Series. If you want to read more details about Hybrid Storage Pools and how exactly all this works, take a look my article on the subject in CACM, as well as this post about the L2ARC (the magic behind using Readzilla) and a nice marketing pitch on HSPs.

Monday Oct 20, 2008

I've written about Hybrid Storage Pools (HSPs) here several times as well as in an article that appeared in the ACM's Queue and CACM publications. Now the folks in Sun marketing on the occasion of our joint SSD announcement with Intel have distilled that down to a four page glossy, and they've done a terrific job. I suggest taking a look.

The concept behind the HSP is a simple one: combine disk, flash, and DRAM into a single coherent and seamless data store that makes optimal use of each component and its economic niche. The mechanics of how this happens required innovation from the Fishworks and ZFS groups to integrate flash as a new tier in storage hierarchy for use in our forthcoming line of storage products. The impact of the HSP is pure economics: it delivers superior capacity and performance for a lower cost and smaller power footprint. That's the marketing pitch; if you want to wade into the details, check out the links above.

Saturday Oct 04, 2008

Back in January, I ranted about Apple's ham-handed breakage in their DTrace port. After some injured feelings and teary embraces, Apple cleaned things up a bit, but some nagging issues remained as I wrote:

For the Apple folks: I'd argue that revealing the name of otherwise untraceable processes is no more transparent than what Activity Monitor provides — could I have that please?

It would be very un-Apple to — you know — communicate future development plans, but in 10.5.5, DTrace has seen another improvement. Previously when using DTrace to observe the system at large, iTunes and other paranoid apps would be hidden; now they're showing up on the radar:

# dtrace -n 'profile-1999{ @[execname] = count(); }'
dtrace: description 'profile-1999' matched 1 probe
^C

  loginwindow                                                       2
  fseventsd                                                         3
  kdcmond                                                           5
  socketfilterfw                                                    5
  distnoted                                                         7
  mds                                                               8
  dtrace                                                           12
  punchin-helper                                                   12
  Dock                                                             20
  Mail                                                             25
  Terminal                                                         26
  SystemUIServer                                                   28
  Finder                                                           42
  Activity Monito                                                  49
  pmTool                                                           67
  WindowServer                                                    184
  iTunes                                                         1482
  kernel_task                                                    4030

And of course, you can use generally available probes to observe only those touchy apps with a predicate:

# dtrace -n 'syscall:::entry/execname == "iTunes"/{ @[probefunc] = count(); }'
dtrace: description 'syscall:::entry' matched 427 probes
^C

...
  pwrite                                                           13
  read                                                             13
  stat64                                                           13
  open_nocancel                                                    14
  getuid                                                           22
  getdirentries                                                    26
  pread                                                            29
  stat                                                             32
  gettimeofday                                                     34
  open                                                             36
  close                                                            37
  geteuid                                                          65
  getattrlist                                                     199
  munmap                                                          328
  mmap                                                            338

Predictably, the details of iTunes are still obscured:

# dtrace -n pid42896:::entry
...
dtrace: error on enabled probe ID 225607 (ID 69364: pid42896:libSystem.B.dylib:pthread_mutex_unlock:entry):
  invalid user access in action #1
dtrace: error on enabled probe ID 225546 (ID 69425: pid42896:libSystem.B.dylib:spin_lock:entry):
  invalid user access in action #1
dtrace: 1005103 drops on CPU 1

... which is fine by me; I've got code of my own I should be investigating. While I'm loath to point it out, an astute reader and savvy DTrace user will note that Apple may have left the door open an inch wider than they had anticipated. Anyone care to post some D code that makes use of that inch? I'll post an update as a comment in a week or two if no one sees it.

Update: There were some good ideas in the comments. Here's the start of a script that can let you follow the flow of control of a thread in an "untraceable" process:

#!/usr/sbin/dtrace -s

#pragma D option quiet

pid$target:libSystem.B.dylib::entry,
pid$target:libSystem.B.dylib::return
{
	trace("this program is already traceable\n");
	exit(0);
}

ERROR
/self->level < 0 || self->level > 40/
{
	self->level = 0;
}

ERROR
{
	this->p = ((dtrace_state_t *)arg0)->dts_ecbs[arg1 - 1]->dte_probe;
	this->mod = this->p->dtpr_mod;
	this->func = this->p->dtpr_func;
	this->entry = ("entry" == stringof(this->p->dtpr_name));
}

ERROR
/this->entry/
{
	printf("%*s-> %s:%s\n", self->level * 2, "",
	    stringof(this->mod), stringof(this->func));
	self->level++;
}

ERROR
/!this->entry/
{
	self->level--;
	printf("%*s<- %s:%s\n", self->level * 2, "",
	    stringof(this->mod), stringof(this->func));
}

Monday Aug 11, 2008

The latest edition of Communications of the ACM includes a panel discussion between "seven world-class storage experts". The primary topic was flash memory and how it impacts the world of storage. The most interesting comment came from Steve Kleiman, Senior Vice President and Chief Scientist at Netapp:

My theory is that whether it’s flash, phase-change memory, or something else, there is a new place in the memory hierarchy. There was a big blank space for decades that is now filled and a lot of things that need to be rethought. There are many implications to this, and we’re just beginning to see the tip of the iceberg.

The statement itself isn't earth-shattering — it would be immodest to say so as I reached the same conclusion in my own CACM article last month — with price trends and performance characteristics, it's obvious that flash has become relevant; those running the numbers as Steve Kleiman has will come to the same conclusion about how it might integrate into a system. What's interesting is that the person at Netapp "responsible for setting future technology directions for the company" has thrown his weight behind the idea. I look forward to seeing how this is manifested in Netapp's future offerings. Will it look something like the Hybrid Storage Pool (HSP) that we've developed with ZFS? Or might it integrate flash more explicitly into the virtual memory system in ONTAP, Netapp's embedded operating system? Soon enough we should start seeing products in the market that validate our expectations for flash and its impact to enterprise storage.

Wednesday Jul 23, 2008

I've written recently about the hybrid storage pool (HSP), using ZFS to augment the conventional storage stack with flash memory. The resulting system improve performance, cost, density, capacity, power dissipation — pretty much evey axis of importance.

An important component of the HSP is something called the second level adaptive replacement cache (L2ARC). This allows ZFS to use flash as a caching tier that falls between RAM and disk in the storage hierarchy, and permits huge working sets to be serviced with latencies under 100us. My colleague, Brendan Gregg, implemented the L2ARC, and has written a great summary of how the L2ARC works and some concrete results. Using the L2ARC, Brendan was able to achieve a 730% performance improvement over 7200RPM drives. Compare that with 15K RPM drives which will improve performance by at most 100-200%, while costing more, using more power, and delivering less total capacity than Brendan's configuration. Score one for the hybrid storage pool!

Tuesday Jul 01, 2008

As I mentioned in my previous post, I wrote an article about the hybrid storage pool (HSP); that article appears in the recently released July issue of Communications of the ACM. You can find it here. In the article, I talk about a novel way of augmenting the traditional storage stack with flash memory as a new level in the hierarchy between DRAM and disk, as well as the ways in which we've adapted ZFS and optimized it for use with flash.

So what's the impact of the HSP? Very simply, the article demonstrates that, considering the axes of cost, throughput, capacity, IOPS and power-efficiency, HSPs can match and exceed what's possible with either drives or flash alone. Further, an HSP can be built or modified to address specific goals independently. For example, it's common to use 15K RPM drives to get high IOPS; unfortunately, they're expensive, power-hungry, and offer only a modest improvement. It's possible to build an HSP that can match the necessary IOPS count at a much lower cost both in terms of the initial investment and the power and cooling costs. As another example, people are starting to consider all-flash solutions to get very high IOPS with low power consumption. Using flash as primary storage means that some capacity will be lost to redundancy. An HSP can provide the same IOPS, but use conventional disks to provide redundancy yielding a significantly lower cost.

My hope — perhaps risibly naive — is that HSPs will mean the eventual death of the 15K RPM drive. If it also puts to bed the notion of flash as general purpose mass storage, well, I'd be happy to see that as well.

Wednesday Jun 11, 2008

Jonathan had a terrific post yesterday that does an excellent job of presenting Sun's strategy for flash for the next few years. With my colleagues at Fishworks, an advanced product development team, I've spent more than a year working with flash and figuring out ways to integrate flash into ZFS, the storage hierarchy, and our future storage products — a fact to which John Fowler, EVP of storage, alluded recently. Flash opens surprising new vistas; it's exciting to see Sun leading in this field, and it's frankly exciting to be part of it.

Jonathan's post sketches out some of the basic ideas on how we're going to be integrating flash into ZFS to create what we call hybrid storage pools that combine flash with conventional (cheap) disks to create an aggregate that's cost-effective, power-efficient, and high-performing by capitalizing on the strengths of the component technologies (not unlike a hybrid car). We presented some early results at IDF which has already been getting a bit of buzz. Next month I have an article in Communications of the ACM that provides many more details on what exactly a hybrid pool is and how exactly it works. I've pulled out some excerpts from that article and included them below as a teaser and will be sure to post an update when the full article is available in print and online.

While its prospects are tantalizing, the challenge is to find uses for flash that strike the right balance of cost and performance. Flash should be viewed not as a replacement for existing storage, but rather as a means to enhance it. Conventional storage systems mix dynamic memory (DRAM) and hard drives; flash is interesting because it falls in a sweet spot between those two components for both cost and performance in that flash is significantly cheaper and denser than DRAM and also significantly faster than disk. Flash accordingly can augment the system to form a new tier in the storage hierarchy – perhaps the most significant new tier since the introduction of the disk drive with RAMAC in 1956.

...

A brute force solution to improve latency is to simply spin the platters faster to reduce rotational latency, using 15k RPM drives rather than 10k RPM or 7,200 RPM drives. This will improve both read and write latency, but only by a factor of two or so. ...

...

ZFS provides for the use of a separate intent-log device, a slog in ZFS jargon, to which synchronous writes can be quickly written and acknowledged to the client before the data is written to the storage pool. The slog is used only for small transactions while large transactions use the main storage pool – it's tough to beat the raw throughput of large numbers of disks. The flash-based log device would be ideally suited for a ZFS slog. ... Using such a device with ZFS in a test system, latencies measure in the range of 80-100µs which approaches the performance of NVRAM while having many other benefits. ...

...

By combining the use of flash as an intent-log to reduce write latency with flash as a cache to reduce read latency, we can create a system that performs far better and consumes less power than other system of similar cost. It's now possible to construct systems with a precise mix of write-optimized flash, flash for caching, DRAM, and cheap disks designed specifically to achieve the right balance of cost and performance for any given workload with data automatically handled by the appropriate level of the hierarchy. ... Most generally, this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management.

Updated July, 1: I've posted the link to the article in my subsequent blog post.

Saturday Jun 07, 2008

Back in January, I posted about a problem with Apple's port of DTrace to Mac OS X. The heart of the issue is that their port would silently drop data such that certain experiments would be quietly invalid. Unfortunately, most reactions seized on a headline paraphrasing a line of the post — albeit with the critical negation omitted (the subject and language were, perhaps, too baroque to expect the press to read every excruciating word). The good news is that Apple has (quietly) fixed the problem in Mac OS X 10.5.3.

One issue was that timer based probes wouldn't fire if certain applications were actively executing (e.g. iTunes). This was evident both by counting periodic probe firings, and by the absence of certain applications when profiling. Apple chose to solve this problem by allowing the probes to fire while denying any inspection of untraceable processes (and generating a verbose error in that case). This script which should count 1000 firings per virtual CPU gave sporadic results on earlier revisions of Mac OS X 10.5:

profile-1000
{
	@ = count();
}

tick-1s
{
	printa(@);
	clear(@);
}

On 10.5.3, the output is exactly what one would expect on a 2-core CPU (1,000 executions per core):

  1  22697                         :tick-1s 
             2000

  1  22697                         :tick-1s 
             2000

On previous revisions, profiling to see what applications were spending the most time on CPU would silently omit certain applications. Now, while we can't actually peer into those apps, we can infer the presence of stealthy apps when we encounter an error:

profile-199
{
	@[execname] = count();
}

ERROR
{
	@["=stealth app="] = count();
}

Running this DTrace script will generate a lot of errors as we try to evaluate the execname variable for secret applications, but at the end we'll end up with a table like this:

  Adium                                                             1
  GrowlHelperApp                                                    1
  iCal                                                              1
  kdcmond                                                           1
  loginwindow                                                       1
  Mail                                                              2
  Activity Monito                                                   3
  ntpd                                                              3
  pmTool                                                            6
  mlb-nexdef-auto                                                  12
  Terminal                                                         14
  =stealth app=                                                    29
  WindowServer                                                     34
  kernel_task                                                     307
  Safari                                                          571

A big thank you to Apple for making progress on this issue; the situation is now much improved and considerably more palatable. That said, there are a couple of problems. The first is squarely the fault of team DTrace: we should probably have a mode where errors aren't printed particularly if the script is already handling them explicitly using an ERROR probe as in the script above. For the Apple folks: I'd argue that revealing the name of otherwise untraceable processes is no more transparent than what Activity Monitor provides — could I have that please? Also, I'm not sure if this has always been true, but the ustack() action doesn't seem to work from the profile action so simple profiling scripts like this one produce a bunch of errors and no output:

profile-199
/execname == "Safari"/
{
	@[ustack()] = count();
}

But to reiterate: thank you thank you thank you, Steve, James, Tom, and the rest of the DTrace folks at Apple. It's great to see these issues being addressed. The whole DTrace community appreciates it.

Monday May 05, 2008

This originally was going to be a post-mortem on dtrace.conf, but so much time has passed, that I doubt it qualifies anymore. Back in March, we held the first ever DTrace (un)conference, and I hope I speak for all involved when I declare it a terrific success. And our t-shirts (logo pictured) were, frankly, bomb. Here are some fairly random impressions from the day:

Notes on the demographics at dtrace.conf: Macs were the most prevalent laptops by quite a wide margin, and a ton of demos were done under VMware for the Mac. There were a handful of dvorak users who far outnumbered the Esperanto speakers (there were none) despite apparently similarly rationales. There were, by a wide margin, more live demonstrations that I'd seen during a day of technical talks; there were probably fewer individual slides than demos -- exactly what we had in mind.

My favorite session brought the authors of the three DTrace ports to the front of the room to talk about porting, and answer questions (mostly from the DTrace team). I was excited that they agreed to work together on a wiki and on a DTrace porting project. Both would be great for new ports and for building a repository that could integrate all the ports into a single repository. I just have to see if I can get them to follow through now several weeks removed from the DTrace love-in...

Also particularly interesting were a demonstration of a DTrace-enabled Adobe Air prototype and the very clever mechanism behind the Java group's plan for native Java static probes (JSDT). Essentially, they're using the same technique as normal USDT, but dynamically generating the tracing description structures and sending them down to the kernel (slick).

The most interesting discussion resulted from Keith's presentation of vprobes -- a DTrace... um... inspired facility in VMware. While it is necessary to place a unified tracing mechanism at the lowest level of software abstraction (in DTrace's case, the kernel), it may also make sense to embed collaborating tracing frameworks at other levels of the stack. For example, the JVM could include a micro-DTrace which communicated with DTrace in the kernel as needed. This would both improve enabled performance (not a primary focus of DTrace), and allow for better domain-specific instrumentation and expression. I'll be interested to see how vprobes executes on this idea.

Requests from the DTrace community:

  • more providers ala the recent nfs and proposed ip providers
  • consistency between providers (kudos to those sending their providers to the DTrace discussion list for review)
  • better compatibility with the ports -- several people observed that while they love the port to Leopard, Apple's spurious exclusion of the -G option created tricky conflicts

Ben was kind enough to video the entire day. We should have the footage publicly available in about a week. Thanks to all who participated; several recent projects have already gotten me excited for dtrace.conf(09).

Thursday Apr 10, 2008

It was a good run, but Jarod and I didn't make the cut for JavaOne this year...

2005

In 2005, Jarod came up with what he described as a jacked up way to use DTrace to get inside Java. This became the basis of the Java provider (first dvm for the 1.4.2 and 1.5 JVMs and now the hotspot provider for Java 6). That year, I got to stand up on stage at the keynote with John Loiacono and present DTrace for Java for the first time (to 10,000 people -- I was nervous). John was then the EVP of software at Sun. Shortly after that, he parlayed our keynote success into a sweet gig at Adobe (I was considered for the job, but ultimately rejected, they said, because their door frames couldn't accommodate my fro -- legal action is pending).

That year we also started the DTrace challenge. The premise was that if we chained up Jarod in the exhibition hall, developers could bring him their applications and he could use DTrace to find a performance win -- or he'd fork over a free iPod. In three years Jarod has given out one iPod and that one deserves a Bondsian asterisk.

After the excitement of the keynote, and the frenetic pace of the exhibition hall (and a haircut), Jarod and I anticipated at least fair interest in our talk, but we expected the numbers to be down a bit because we were presenting in the afternoon on the last day of the conference. We got to the room 15 minutes early to set up, skirting what we thought must have been the line for lunch, or free beer, or something, but turned out to be the line for our talk. Damn. It turns out that in addition to the 1,000 in the room, there was an overflow room with another 500-1,000 people. That first DTrace for Java talk had only the most basic features like tracing method entry and return, memory allocation, and Java stack backtraces -- but we already knew we were off to a good start.

2006

No keynote, but the DTrace challenge was on again and our talk reprised its primo slot on the last day of the conference after lunch (yes, that's sarcasm). That year the Java group took the step of including DTrace support in the JVM itself. It was also possible to dynamically turn instrumentation of the JVM off and on as opposed to the start-time option of the year before. In addition to our talk, there was a DTrace hands-on lab that was quite popular and got people some DTrace experience after watching what it can do in the hands of someone like Jarod.

2007

The DTrace talk in 2007 (again, last day of the conference after lunch) was actually one of my favorite demos I've given because I had never seen the technology we were presenting before. Shortly before JavaOne started, Lev Serebryakov from the Java group had built a way of embedding static probes in a Java program. While this isn't required to trace Java code, it does mean that developers can expose the higher level semantics of their programs to users and developers through DTrace. Jarod hacked up an example in his hotel room about 20 minutes before we presented, and amazingly it all went off without a hitch. How money is that?

JSDT -- as the Java Statically Defined Tracing is called -- is in development for the next version of the JVM, and is the next step for DTrace support of dynamic languages. Java was the first dynamic language that we first considered for use with DTrace, and it's quite a tough environment to support due to the incredible sophistication of the JVM. That support has lead the way for other dynamic languages such as Ruby, Perl, and Python which all now have built-in DTrace providers.

2008

For DTrace and Java, this is not the end. It is not even the beginning of the end. Jarod and I are out, but Jon, Simon, Angelo, Raghavan, Amit, and others are in. At JavaOne 2008 next month there will be a talk, a BOF, and a hands-on lab about DTrace for Java and it's not even all Java: there's some php and JavaScript mixed in and both also have their own DTrace providers. I've enjoyed speaking at JavaOne these past three years, and while it's good to pass the torch, I'll miss doing it again this year. If I have the time, and can get past security I'll try to sneak into Jon and Simon's talk -- though it will be a departure from tradition for a DTrace talk to fall on a day other than the last.

Monday Apr 07, 2008

I was having a conversation with an OpenBSD user and developer the other day, and he mentioned some ongoing work in the community to consolidate support for RAID controllers. The problem, he was saying, was that each controller had a different administrative model and utility -- but all I could think was that the real problem was the presence of a RAID controller in the first place! As far as I'm concerned, ZFS and RAID-Z have obviated the need for hardware RAID controllers.

ZFS users seem to love RAID-Z, but a frustratingly frequent request is to be able to expand the width of a RAID-Z stripe. While the ZFS community may care about solving this problem, it's not the highest priority for Sun's customers and, therefore, for the ZFS team. It's common for a home user to want to increase his total storage capacity by a disk or two at a time, but enterprise customers typically want to grow by multiple terabytes at once so adding on a new RAID-Z stripe isn't an issue. When the request has come up on the ZFS discussion list, we have, perhaps unhelpfully, pointed out that the code is all open source and ready for that contribution. Partly, it's because we don't have time to do it ourselves, but also because it's a tricky problem and we weren't sure how to solve it.

Jeff Bonwick did a great job explaining how RAID-Z works, so I won't go into it too much here, but the structure of RAID-Z makes it a bit trickier to expand than other RAID implementations. On a typical RAID with N+M disks, N data sectors will be written with M parity sectors. Those N data sectors may contain unrelated data so adding modifying data on just one disk involves reading the data off that disk and updating both those data and the parity data. Expanding a RAID stripe in such a scheme is as simple as adding a new disk and updating the parity (if necessary). With RAID-Z, blocks are never rewritten in place, and there may be multiple logical RAID stripes (and multiple parity sectors) in a given row; we therefore can't expand the stripe nearly as easily.

A couple of weeks ago, I had lunch with Matt Ahrens to come up with a mechanism for expanding RAID-Z stripes -- we were both tired of having to deflect reasonable requests from users -- and, lo and behold, we figured out a viable technique that shouldn't be very tricky to implement. While Sun still has no plans to allocate resources to the problem, this roadmap should lend credence to the suggestion that someone in the community might work on the problem.

The rest of this post will discuss the implementation of expandable RAID-Z; it's not intended for casual users of ZFS, and there are no alchemic secrets buried in the details. It would probably be useful to familiarize yourself with the basic structure of ZFS, space maps (totally cool by the way), and the code for RAID-Z.

Dynamic Geometry

ZFS uses vdevs -- virtual devices -- to store data. A vdev may correspond to a disk or a file, or it may be an aggregate such as a mirror or RAID-Z. Currently the RAID-Z vdev determines the stripe width from the number of child vdevs. To allow for RAID-Z expansion, the geometry would need to be a more dynamic property. The storage pool code that uses the vdev would need to determine the geometry for the current block and then pass that as a parameter to the various vdev functions.

There are two ways to record the geometry. The simplest is to use the GRID bits (an 8 bit field) in the DVA (Device Virtual Address) which have already been set aside, but are currently unused. In this case, the vdev would need to have a new callback to set the contents of the GRID bits, and then a parameter to several of its other functions to pass in the GRID bits to indicate the geometry of the vdev when the block was written. An alternative approach suggested by Jeff and Bill Moore is something they call time-dependent geometry. The basic idea is that we store a record each time the geometry of a vdev is modified and then use the creation time for a block to infer the geometry to pass to the vdev. This has the advantage of conserving precious bits in the fixed-width DVA (though at 128 bits its still quite big), but it is quite a bit more complex since it would require essentially new metadata hanging off each RAID-Z vdev.

Metaslab Folding

When the user requests a RAID-Z vdev be expanded (via an existing or new zpool(1M) command-line option) we'll apply a new fold operation to the space map for each metaslab. This transformation will take into account the space we're about to add with the new devices. Each range [a, b] under a fold from width n to width m will become

[ m * (a / n) + (a % n), m * (b / n) + b % n ]

The alternative would have been to account for m - n free blocks at the end of every stripe, but that would have been overly onerous both in terms of processing and in terms of bookkeeping. For space maps that are resident, we can simply perform the operation on the AVL tree by iterating over each node and applying the necessary transformation. For space maps which aren't in core, we can do something rather clever: by taking advantage of the log structure, we can simply append a new type of space map entry that indicates that this operation should be applied. Today we have allocated, free, and debug; this would add fold as an additional operation. We'd apply that fold operation to each of the 200 or so space maps for the given vdev. Alternatively, using the idea of time-dependent geometry above, we could simply append a marker to the space map and access the geometry from that repository.

Normally, we only rewrite the space map if the on-disk, log-structure is twice as large as necessary. I'd argue that the fold operation should always trigger a rewrite since processing it always requires a O(n) operation, but that's really an ancillary central point.

vdev Update

At the same time as the previous operation, the vdev metadata will need to be updated to reflect the additional device. This is mostly just bookkeeping, and a matter of chasing down the relevant code paths to modify and augment.

Scrub

With the steps above, we're actually done for some definition since new data will spread be written in stripes that include the newly added device. The problem is that extant data will still be stored in the old geometry and most of the capacity of the new device will be inaccessible. The solution to this is to scrub the data reading off every block and rewriting it to a new location. Currently this isn't possible on ZFS, but Matt and Mark Maybee have been working on something they call block pointer rewrite which is needed to solve a variety of other problems and nicely completes this solution as well.

That's It

After Matt and I had finished thinking this through, I think we were both pleased by the relative simplicity of the solution. That's not to say that implementing it is going to be easy -- there's still plenty of gaps to fill in -- but the basic algorithm is sound. A nice property that falls out is that in addition to changing the number of data disks, it would also be possible to use the same mechanism to add an additional parity disk to go from single- to double-parity RAID-Z -- another common request.

So I can now extend a slightly more welcoming invitation to the ZFS community to engage on this problem and contribute in a very concrete way. I've posted some diffs which I used sketch out some ideas; that might be a useful place to start. If anyone would like to create a project on OpenSolaris.org to host any ongoing work, I'd be happy to help set that up.

Thursday Mar 13, 2008

The other day, there was an interesting post on the DTrace mailing list asking how to derive a process name from a pid. This really ought to be a built-in feature of D, but it isn't (at least not yet). I hacked up a solution to the user's problem by cribbing the algorithm from mdb's ::pid2proc function whose source code you can find here. The basic idea is that you need to look up the pid in pidhash to get a chain of struct pid that you need to walk until you find the pid in question. This in turn gives you an index into procdir which is an array of pointers to proc structures. To find out more about these structures, poke around the source code or mdb -k which is what I did.

The code isn't exactly gorgeous, but it gets the job done. It's a good example of probe-local variables (also somewhat misleadingly called clause-local variables), and demonstrates how you can use them to communicate values between clauses associated with a given probe during a given firing. You can try it out by running dtrace -c <your-command> -s <this-script>.

BEGIN
{
       this->pidp = `pidhash[$target & (`pid_hashsz - 1)];
       this->pidname = "-error-";
}

/* Repeat this clause to accommodate longer hash chains. */
BEGIN
/this->pidp->pid_id != $target && this->pidp->pid_link != 0/
{
       this->pidp = this->pidp->pid_link;
}

BEGIN
/this->pidp->pid_id != $target && this->pidp->pid_link == 0/
{
       this->pidname = "-no such process-";
}

BEGIN
/this->pidp->pid_id != $target && this->pidp->pid_link != 0/
{
       this->pidname = "-hash chain too long-";
}

BEGIN
/this->pidp->pid_id == $target/
{
       /* Workaround for bug 6465277 */
       this->slot = (*(uint32_t *)this->pidp) >> 8;

       /* AHA! We finally have the proc_t. */
       this->procp = `procdir[this->slot].pe_proc;

       /* For this example, we'll grab the process name to print. */
       this->pidname = this->procp->p_user.u_comm;
}

BEGIN
{
       printf("%d %s", $target, this->pidname);
}

Note that the second clause is the bit that walks the hash chain. You can repeat this clause as many times as you think will be needed to traverse the hash chain -- I really don't have any guidance here, but I imagine that a few times should suffice. Alternatively, you could construct a tick probe that steps along the hash chain to avoid a fixed limit. DTrace attempts to keep easy things easy and make difficult things possible. As evidenced by this example, possible doesn't necessarily correlate with beautiful.

Friday Jan 18, 2008

As has been thoroughly recorded, Apple has included DTrace in Mac OS X. I've been using it as often as I have the opportunity, and it's a joy to be able to use the fruits of our labor on another operating system. But I hit a rather surprising case recently which led me to discover a serious problem with Apple's implementation.

A common trick with DTrace is to use a tick probe to report data periodically. For example, the following script reports the ten most frequently accessed files every 10 seconds:

io:::start
{
	@[args[2]->fi_pathname] = count();
}

tick-10s
{
	trunc(@, 10);
	printa(@);
	trunc(@, 0);
}

This was running fine, but it seemed as though sometimes (particularly with certain apps in the background) it would occasionally skip one of the ten second iterations. Odd. So I wrote the following script to see what was going on:

profile-1000
{
	@ = count();
}

tick-1s
{
	printa(@);
	clear(@);
}

What this will do is fire a probe at 1000hz on all (logical) CPUs. Running this on a dual-core machine we'd expect to see it print out 2000 each time. Instead I saw this:

  0  22369                         :tick-1s 
             1803

  0  22369                         :tick-1s 
             1736

  0  22369                         :tick-1s 
             1641

  0  22369                         :tick-1s 
             3323

  0  22369                         :tick-1s 
             1704

  0  22369                         :tick-1s 
             1732

  0  22369                         :tick-1s 
             1697

  0  22369                         :tick-1s 
             5154

Kind of bizarre. The missing tick-1s probes explain the values over 2000, but weirder were the values so far under 2000. To explore a bit more I performed another DTrace experiment to see what applications were running when the profile probe fired:

# dtrace -n profile-997'{ @[execname] = count(); }'
dtrace: description 'profile-997' matched 1 probe
^C

  Finder                                                            1
  configd                                                           1
  DirectoryServic                                                   2
  GrowlHelperApp                                                    2
  llipd                                                             2
  launchd                                                           3
  mDNSResponder                                                     3
  fseventsd                                                         4
  mds                                                               4
  lsd                                                               5
  ntpd                                                              6
  kdcmond                                                           7
  SystemUIServer                                                    8
  dtrace                                                            8
  loginwindow                                                       9
  pvsnatd                                                          21
  Dock                                                             41
  Activity Monito                                                  45
  pmTool                                                           52
  Google Notifier                                                  60
  Terminal                                                        153
  WindowServer                                                    238
  Safari                                                         1361
  kernel_task                                                    4247

While there's nothing suspicious about the output in itself, it was strange because I was listening to music at the time. With iTunes. Where was iTunes?
I ran the first experiment again and caused iTunes to do more work which yielded these results:

  0  22369                         :tick-1s 
             3856

  0  22369                         :tick-1s 
             1281

  0  22369                         :tick-1s 
             4770

  0  22369                         :tick-1s 
             2271

So what was iTunes doing? To answer that I again turned to DTrace and used the following enabling to see what functions were being called most frequently by iTunes (whose process ID was 332):

# dtrace -n 'pid332:::entry{ @[probefunc] = count(); }'
dtrace: description 'pid332:::entry' matched 264630 probes

I let it run for a while, made iTunes do some work, and the result when I stopped the script? Nothing. The expensive DTrace invocation clearly caused iTunes to do a lot more work, but DTrace was giving me no output.
Which started me thinking... did they? Surely not. They wouldn't disable DTrace for certain applications.

But that's exactly what Apple's done with their DTrace implementation. The notion of true systemic tracing was a bit too egalitarian for their classist sensibilities so they added this glob of lard into dtrace_probe() -- the heart of DTrace:

#if defined(__APPLE__)
        /*
         * If the thread on which this probe has fired belongs to a process marked P_LNOATTACH
         * then this enabling is not permitted to observe it. Move along, nothing to see here.
         */
        if (ISSET(current_proc()->p_lflag, P_LNOATTACH)) {
            continue;
        }
#endif /* __APPLE__ */

Wow. So Apple is explicitly preventing DTrace from examining or recording data for processes which don't permit tracing. This is antithetical to the notion of systemic tracing, antithetical to the goals of DTrace, and antithetical to the spirit of open source. I'm sure this was inserted under pressure from ISVs, but that makes the pill no easier to swallow. To say that Apple has crippled DTrace on Mac OS X would be a bit alarmist, but they've certainly undermined its efficacy and, in doing do, unintentionally damaged some of its most basic functionality. To users of Mac OS X and of DTrace: Apple has done a service by porting DTrace, but let's convince them to go one step further and port it properly.

Saturday Oct 27, 2007

It's been more than a year since I first saw DTrace on Mac OS X, and now it's at last generally available to the public. Not only did Apple port DTrace, but they've also included a bunch of USDT providers. Perl, Python, Ruby -- they all ship in Leopard with built-in DTrace probes that allow developers to observe function calls, object allocation, and other points of interest from the perspective of that dynamic language. Apple did make some odd choices (e.g. no Java provider, spurious modifications to the publicly available providers, a different build process), but on the whole it's very impressive.

Perhaps it was too much to hope for, but with Apple's obvious affection for DTrace I thought they might include USDT probes for Safari. Specifically, probes in the JavaScript interpreter would empower developers in the same way they enabled Ruby, Perl, and Python developers. Fortunately, the folks at the Mozilla Foundation have already done the heavy lifting for Firefox -- it was just a matter of compiling Firefox on Mac OS X 10.5 with DTrace enabled:

There were some minor modifications I had to make to the Firefox build process to get everything working, but it wasn't too tricky. I'll try to get a patch submitted this week, and then Firefox will have the same probes on Mac OS X that it does -- thanks to Brendan's early efforts -- on Solaris. JavaScript developers take note: this is good news.