| |
Adam Leventhal's Weblog
inside the sausage factory
Monday May 05, 2008 |
|
dtrace.conf post-post-mortem
This originally was going to be a post-mortem on dtrace.conf, but so much time has passed, that I doubt it qualifies anymore. Back in March, we held the first ever DTrace (un)conference, and I hope I speak for all involved when I declare it a terrific success. And our t-shirts (logo pictured) were, frankly, bomb. Here are some fairly random impressions from the day:
Notes on the demographics at dtrace.conf: Macs were the most prevalent laptops by quite a wide margin, and a ton of demos were done under VMware for the Mac. There were a handful of dvorak users who far outnumbered the Esperanto speakers (there were none) despite apparently similarly rationales. There were, by a wide margin, more live demonstrations that I'd seen during a day of technical talks; there were probably fewer individual slides than demos -- exactly what we had in mind.
My favorite session brought the authors of the three DTrace ports to the front of the room to talk about porting, and answer questions (mostly from the DTrace team). I was excited that they agreed to work together on a wiki and on a DTrace porting project. Both would be great for new ports and for building a repository that could integrate all the ports into a single repository. I just have to see if I can get them to follow through now several weeks removed from the DTrace love-in...
Also particularly interesting were a demonstration of a DTrace-enabled Adobe Air prototype and the very clever mechanism behind the Java group's plan for native Java static probes (JSDT). Essentially, they're using the same technique as normal USDT, but dynamically generating the tracing description structures and sending them down to the kernel (slick).
The most interesting discussion resulted from Keith's presentation of vprobes -- a DTrace... um... inspired facility in VMware. While it is necessary to place a unified tracing mechanism at the lowest level of software abstraction (in DTrace's case, the kernel), it may also make sense to embed collaborating tracing frameworks at other levels of the stack. For example, the JVM could include a micro-DTrace which communicated with DTrace in the kernel as needed. This would both improve enabled performance (not a primary focus of DTrace), and allow for better domain-specific instrumentation and expression. I'll be interested to see how vprobes executes on this idea.
Requests from the DTrace community:
- more provider ala the recent nfs and proposed ip providers
- consistency between providers (kudos to those sending their providers to the DTrace discussion list for review)
- better compatibility with the ports -- several people observed that while they love the port to Leopard, Apple's spurious exclusion of the -G option created tricky conflicts
Ben was kind enough to video the entire day. We should have the footage publicly available in about a week. Thanks to all who participated; several recent projects have already gotten me excited for dtrace.conf(09).
(2008-05-05 00:06:45.0/2008-05-05 00:06:10.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_conf_post_post_mortem
|
|
|
Thursday April 10, 2008 |
|
DTrace and JavaOne: The End of the Beginning
It was a good run, but Jarod and I didn't make the cut for JavaOne this year...
2005
In 2005, Jarod came up with what he described as a jacked up way to use DTrace to get inside Java. This became the basis of the Java provider (first dvm for the 1.4.2 and 1.5 JVMs and now the hotspot provider for Java 6). That year, I got to stand up on stage at the keynote with John Loiacono and present DTrace for Java for the first time (to 10,000 people -- I was nervous). John was then the EVP of software at Sun. Shortly after that, he parlayed our keynote success into a sweet gig at Adobe (I was considered for the job, but ultimately rejected, they said, because their door frames couldn't accommodate my fro -- legal action is pending).
That year we also started the DTrace challenge. The premise was that if we chained up Jarod in the exhibition hall, developers could bring him their applications and he could use DTrace to find a performance win -- or he'd fork over a free iPod. In three years Jarod has given out one iPod and that one deserves a Bondsian asterisk.
After the excitement of the keynote, and the frenetic pace of the exhibition hall (and a haircut), Jarod and I anticipated at least fair interest in our talk, but we expected the numbers to be down a bit because we were presenting in the afternoon on the last day of the conference. We got to the room 15 minutes early to set up, skirting what we thought must have been the line for lunch, or free beer, or something, but turned out to be the line for our talk. Damn. It turns out that in addition to the 1,000 in the room, there was an overflow room with another 500-1,000 people. That first DTrace for Java talk had only the most basic features like tracing method entry and return, memory allocation, and Java stack backtraces -- but we already knew we were off to a good start.
2006
No keynote, but the DTrace challenge was on again and our talk reprised its primo slot on the last day of the conference after lunch (yes, that's sarcasm). That year the Java group took the step of including DTrace support in the JVM itself. It was also possible to dynamically turn instrumentation of the JVM off and on as opposed to the start-time option of the year before. In addition to our talk, there was a DTrace hands-on lab that was quite popular and got people some DTrace experience after watching what it can do in the hands of someone like Jarod.
2007
The DTrace talk in 2007 (again, last day of the conference after lunch) was actually one of my favorite demos I've given because I had never seen the technology we were presenting before. Shortly before JavaOne started, Lev Serebryakov from the Java group had built a way of embedding static probes in a Java program. While this isn't required to trace Java code, it does mean that developers can expose the higher level semantics of their programs to users and developers through DTrace. Jarod hacked up an example in his hotel room about 20 minutes before we presented, and amazingly it all went off without a hitch. How money is that?
JSDT -- as the Java Statically Defined Tracing is called -- is in development for the next version of the JVM, and is the next step for DTrace support of dynamic languages. Java was the first dynamic language that we first considered for use with DTrace, and it's quite a tough environment to support due to the incredible sophistication of the JVM. That support has lead the way for other dynamic languages such as Ruby, Perl, and Python which all now have built-in DTrace providers.
2008
For DTrace and Java, this is not the end. It is not even the beginning of the end. Jarod and I are out, but Jon, Simon, Angelo, Raghavan, Amit, and others are in. At JavaOne 2008 next month there will be a talk, a BOF, and a hands-on lab about DTrace for Java and it's not even all Java: there's some php and JavaScript mixed in and both also have their own DTrace providers. I've enjoyed speaking at JavaOne these past three years, and while it's good to pass the torch, I'll miss doing it again this year. If I have the time, and can get past security I'll try to sneak into Jon and Simon's talk -- though it will be a departure from tradition for a DTrace talk to fall on a day other than the last.
(2008-04-10 10:52:38.0/2008-04-10 09:00:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_at_javaone_no_more
|
|
|
Monday April 07, 2008 |
|
I was having a conversation with an OpenBSD user and developer the other day, and he mentioned some ongoing work in the community to consolidate support for RAID controllers. The problem, he was saying, was that each controller had a different administrative model and utility -- but all I could think was that the real problem was the presence of a RAID controller in the first place! As far as I'm concerned, ZFS and RAID-Z have obviated the need for hardware RAID controllers.
ZFS users seem to love RAID-Z, but a frustratingly frequent request is to be able to expand the width of a RAID-Z stripe. While the ZFS community may care about solving this problem, it's not the highest priority for Sun's customers and, therefore, for the ZFS team. It's common for a home user to want to increase his total storage capacity by a disk or two at a time, but enterprise customers typically want to grow by multiple terabytes at once so adding on a new RAID-Z stripe isn't an issue. When the request has come up on the ZFS discussion list, we have, perhaps unhelpfully, pointed out that the code is all open source and ready for that contribution. Partly, it's because we don't have time to do it ourselves, but also because it's a tricky problem and we weren't sure how to solve it.
Jeff Bonwick did a great job explaining how RAID-Z works, so I won't go into it too much here, but the structure of RAID-Z makes it a bit trickier to expand than other RAID implementations. On a typical RAID with N+M disks, N data sectors will be written with M parity sectors. Those N data sectors may contain unrelated data so adding modifying data on just one disk involves reading the data off that disk and updating both those data and the parity data. Expanding a RAID stripe in such a scheme is as simple as adding a new disk and updating the parity (if necessary). With RAID-Z, blocks are never rewritten in place, and there may be multiple logical RAID stripes (and multiple parity sectors) in a given row; we therefore can't expand the stripe nearly as easily.
A couple of weeks ago, I had lunch with Matt Ahrens to come up with a mechanism for expanding RAID-Z stripes -- we were both tired of having to deflect reasonable requests from users -- and, lo and behold, we figured out a viable technique that shouldn't be very tricky to implement. While Sun still has no plans to allocate resources to the problem, this roadmap should lend credence to the suggestion that someone in the community might work on the problem.
The rest of this post will discuss the implementation of expandable RAID-Z; it's not intended for casual users of ZFS, and there are no alchemic secrets buried in the details. It would probably be useful to familiarize yourself with the basic structure of ZFS, space maps (totally cool by the way), and the code for RAID-Z.
Dynamic Geometry
ZFS uses vdevs -- virtual devices -- to store data. A vdev may correspond to a disk or a file, or it may be an aggregate such as a mirror or RAID-Z. Currently the RAID-Z vdev determines the stripe width from the number of child vdevs. To allow for RAID-Z expansion, the geometry would need to be a more dynamic property. The storage pool code that uses the vdev would need to determine the geometry for the current block and then pass that as a parameter to the various vdev functions.
There are two ways to record the geometry. The simplest is to use the GRID bits (an 8 bit field) in the DVA (Device Virtual Address) which have already been set aside, but are currently unused. In this case, the vdev would need to have a new callback to set the contents of the GRID bits, and then a parameter to several of its other functions to pass in the GRID bits to indicate the geometry of the vdev when the block was written. An alternative approach suggested by Jeff and Bill Moore is something they call time-dependent geometry. The basic idea is that we store a record each time the geometry of a vdev is modified and then use the creation time for a block to infer the geometry to pass to the vdev. This has the advantage of conserving precious bits in the fixed-width DVA (though at 128 bits its still quite big), but it is quite a bit more complex since it would require essentially new metadata hanging off each RAID-Z vdev.
Metaslab Folding
When the user requests a RAID-Z vdev be expanded (via an existing or new zpool(1M) command-line option) we'll apply a new fold operation to the space map for each metaslab. This transformation will take into account the space we're about to add with the new devices. Each range [a, b] under a fold from width n to width m will become
[ m * (a / n) + (a % n), m * (b / n) + b % n ]
The alternative would have been to account for m - n free blocks at the end of every stripe, but that would have been overly onerous both in terms of processing and in terms of bookkeeping. For space maps that are resident, we can simply perform the operation on the AVL tree by iterating over each node and applying the necessary transformation. For space maps which aren't in core, we can do something rather clever: by taking advantage of the log structure, we can simply append a new type of space map entry that indicates that this operation should be applied. Today we have allocated, free, and debug; this would add fold as an additional operation. We'd apply that fold operation to each of the 200 or so space maps for the given vdev. Alternatively, using the idea of time-dependent geometry above, we could simply append a marker to the space map and access the geometry from that repository.
Normally, we only rewrite the space map if the on-disk, log-structure is twice as large as necessary. I'd argue that the fold operation should always trigger a rewrite since processing it always requires a O(n) operation, but that's really an ancillary central point.
vdev Update
At the same time as the previous operation, the vdev metadata will need to be updated to reflect the additional device. This is mostly just bookkeeping, and a matter of chasing down the relevant code paths to modify and augment.
Scrub
With the steps above, we're actually done for some definition since new data will spread be written in stripes that include the newly added device. The problem is that extant data will still be stored in the old geometry and most of the capacity of the new device will be inaccessible. The solution to this is to scrub the data reading off every block and rewriting it to a new location. Currently this isn't possible on ZFS, but Matt and Mark Maybee have been working on something they call block pointer rewrite which is needed to solve a variety of other problems and nicely completes this solution as well.
That's It
After Matt and I had finished thinking this through, I think we were both pleased by the relative simplicity of the solution. That's not to say that implementing it is going to be easy -- there's still plenty of gaps to fill in -- but the basic algorithm is sound. A nice property that falls out is that in addition to changing the number of data disks, it would also be possible to use the same mechanism to add an additional parity disk to go from single- to double-parity RAID-Z -- another common request.
So I can now extend a slightly more welcoming invitation to the ZFS community to engage on this problem and contribute in a very concrete way. I've posted some diffs which I used sketch out some ideas; that might be a useful place to start. If anyone would like to create a project on OpenSolaris.org to host any ongoing work, I'd be happy to help set that up.
(2008-04-08 13:41:33.0/2008-04-07 21:59:03.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z
|
|
|
Thursday March 13, 2008 |
|
The other day, there was an interesting post on the DTrace mailing list asking how to derive a process name from a pid. This really ought to be a built-in feature of D, but it isn't (at least not yet). I hacked up a solution to the user's problem by cribbing the algorithm from mdb's ::pid2proc function whose source code you can find here. The basic idea is that you need to look up the pid in pidhash to get a chain of struct pid that you need to walk until you find the pid in question. This in turn gives you an index into procdir which is an array of pointers to proc structures. To find out more about these structures, poke around the source code or mdb -k which is what I did.
The code isn't exactly gorgeous, but it gets the job done. It's a good example of probe-local variables (also somewhat misleadingly called clause-local variables), and demonstrates how you can use them to communicate values between clauses associated with a given probe during a given firing. You can try it out by running dtrace -c <your-command> -s <this-script>.
BEGIN
{
this->pidp = `pidhash[$target & (`pid_hashsz - 1)];
this->pidname = "-error-";
}
/* Repeat this clause to accommodate longer hash chains. */
BEGIN
/this->pidp->pid_id != $target && this->pidp->pid_link != 0/
{
this->pidp = this->pidp->pid_link;
}
BEGIN
/this->pidp->pid_id != $target && this->pidp->pid_link == 0/
{
this->pidname = "-no such process-";
}
BEGIN
/this->pidp->pid_id != $target && this->pidp->pid_link != 0/
{
this->pidname = "-hash chain too long-";
}
BEGIN
/this->pidp->pid_id == $target/
{
/* Workaround for bug 6465277 */
this->slot = (*(uint32_t *)this->pidp) >> 8;
/* AHA! We finally have the proc_t. */
this->procp = `procdir[this->slot].pe_proc;
/* For this example, we'll grab the process name to print. */
this->pidname = this->procp->p_user.u_comm;
}
BEGIN
{
printf("%d %s", $target, this->pidname);
}
Note that the second clause is the bit that walks the hash chain. You can repeat this clause as many times as you think will be needed to traverse the hash chain -- I really don't have any guidance here, but I imagine that a few times should suffice. Alternatively, you could construct a tick probe that steps along the hash chain to avoid a fixed limit. DTrace attempts to keep easy things easy and make difficult things possible. As evidenced by this example, possible doesn't necessarily correlate with beautiful.
(2008-03-13 01:11:05.0/2008-03-13 01:11:05.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/pid2proc_for_dtrace
|
|
|
Friday January 18, 2008 |
|
Mac OS X and the missing probes
As has been thoroughly recorded, Apple has included DTrace in Mac OS X. I've been using it as often as I have the opportunity, and it's a joy to be able to use the fruits of our labor on another operating system. But I hit a rather surprising case recently which led me to discover a serious problem with Apple's implementation.
A common trick with DTrace is to use a tick probe to report data periodically. For example, the following script reports the ten most frequently accessed files every 10 seconds:
io:::start
{
@[args[2]->fi_pathname] = count();
}
tick-10s
{
trunc(@, 10);
printa(@);
trunc(@, 0);
}
This was running fine, but it seemed as though sometimes (particularly with certain apps in the background) it would occasionally skip one of the ten second iterations. Odd. So I wrote the following script to see what was going on:
profile-1000
{
@ = count();
}
tick-1s
{
printa(@);
clear(@);
}
What this will do is fire a probe at 1000hz on all (logical) CPUs. Running this on a dual-core machine we'd expect to see it print out 2000 each time. Instead I saw this:
0 22369 :tick-1s
1803
0 22369 :tick-1s
1736
0 22369 :tick-1s
1641
0 22369 :tick-1s
3323
0 22369 :tick-1s
1704
0 22369 :tick-1s
1732
0 22369 :tick-1s
1697
0 22369 :tick-1s
5154
Kind of bizarre. The missing tick-1s probes explain the values over 2000, but weirder were the values so far under 2000. To explore a bit more I performed another DTrace experiment to see what applications were running when the profile probe fired:
# dtrace -n profile-997'{ @[execname] = count(); }'
dtrace: description 'profile-997' matched 1 probe
^C
Finder 1
configd 1
DirectoryServic 2
GrowlHelperApp 2
llipd 2
launchd 3
mDNSResponder 3
fseventsd 4
mds 4
lsd 5
ntpd 6
kdcmond 7
SystemUIServer 8
dtrace 8
loginwindow 9
pvsnatd 21
Dock 41
Activity Monito 45
pmTool 52
Google Notifier 60
Terminal 153
WindowServer 238
Safari 1361
kernel_task 4247
While there's nothing suspicious about the output in itself, it was strange because I was listening to music at the time. With iTunes. Where was iTunes?
I ran the first experiment again and caused iTunes to do more work which yielded these results:
0 22369 :tick-1s
3856
0 22369 :tick-1s
1281
0 22369 :tick-1s
4770
0 22369 :tick-1s
2271
So what was iTunes doing? To answer that I again turned to DTrace and used the following enabling to see what functions were being called most frequently by iTunes (whose process ID was 332):
# dtrace -n 'pid332:::entry{ @[probefunc] = count(); }'
dtrace: description 'pid332:::entry' matched 264630 probes
I let it run for a while, made iTunes do some work, and the result when I stopped the script? Nothing. The expensive DTrace invocation clearly caused iTunes to do a lot more work, but DTrace was giving me no output.
Which started me thinking... did they? Surely not. They wouldn't disable DTrace for certain applications.
But that's exactly what Apple's done with their DTrace implementation. The notion of true systemic tracing was a bit too egalitarian for their classist sensibilities so they added this glob of lard into dtrace_probe() -- the heart of DTrace:
#if defined(__APPLE__)
/*
* If the thread on which this probe has fired belongs to a process marked P_LNOATTACH
* then this enabling is not permitted to observe it. Move along, nothing to see here.
*/
if (ISSET(current_proc()->p_lflag, P_LNOATTACH)) {
continue;
}
#endif /* __APPLE__ */
Wow. So Apple is explicitly preventing DTrace from examining or recording data for processes which don't permit tracing. This is antithetical to the notion of systemic tracing, antithetical to the goals of DTrace, and antithetical to the spirit of open source. I'm sure this was inserted under pressure from ISVs, but that makes the pill no easier to swallow. To say that Apple has crippled DTrace on Mac OS X would be a bit alarmist, but they've certainly undermined its efficacy and, in doing do, unintentionally damaged some of its most basic functionality. To users of Mac OS X and of DTrace: Apple has done a service by porting DTrace, but let's convince them to go one step further and port it properly.
(2008-01-18 23:54:56.0/2008-01-18 23:49:35.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/mac_os_x_and_the
|
|
|
Saturday October 27, 2007 |
|
It's been more than a year since I first saw DTrace on Mac OS X, and now it's at last generally available to the public. Not only did Apple port DTrace, but they've also included a bunch of USDT providers. Perl, Python, Ruby -- they all ship in Leopard with built-in DTrace probes that allow developers to observe function calls, object allocation, and other points of interest from the perspective of that dynamic language. Apple did make some odd choices (e.g. no Java provider, spurious modifications to the publicly available providers, a different build process), but on the whole it's very impressive.
Perhaps it was too much to hope for, but with Apple's obvious affection for DTrace I thought they might include USDT probes for Safari. Specifically, probes in the JavaScript interpreter would empower developers in the same way they enabled Ruby, Perl, and Python developers. Fortunately, the folks at the Mozilla Foundation have already done the heavy lifting for Firefox -- it was just a matter of compiling Firefox on Mac OS X 10.5 with DTrace enabled:
There were some minor modifications I had to make to the Firefox build process to get everything working, but it wasn't too tricky. I'll try to get a patch submitted this week, and then Firefox will have the same probes on Mac OS X that it does -- thanks to Brendan's early efforts -- on Solaris. JavaScript developers take note: this is good news.
(2007-10-27 11:00:46.0/2007-10-27 10:46:50.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_firefox_leopard
|
|
|
Monday August 06, 2007 |
|
What-If Machine: DTrace Port

What if there were a port of DTrace to Linux?
What if there were a port of DTrace to Linux: could such a thing be done without violating either the GPL or CDDL? Read on before you jump right to the comments section to add your two cents.
In my last post, I discussed an attempt to create a DTrace knockoff in Linux, and suggested that a port might be possible. Naively, I hoped that comments would
examine the heart of my argument,
bemoan the apparent NIH in the Linux knockoff,
regret the misappropriation of slideware, and maybe discuss some
technical details -- anything but
dwell on licensing issues.
For this post, I welcome the debate. Open source licenses are important, and the choice can have a profound impact on the success of the software and the community. But conversations comparing the excruciating minutia of one license and another are exhausting, and usually become pointless in a hurry. Having a concrete subject might lead to a productive conversation.
DTrace Port Details
Just for the sake of discussion, let's say that Google decide to port DTrace to Linux (everyone loves Google, right?). This isn't so far fetched: Google uses Linux internally, maybe they're using SystemTap, maybe they're not happy with it, but they definitely (probably) care about dynamic tracing (just like all good system administrators and developers should). So suppose some engineers at Google take the following (purely hypothetical) steps:
Kernel Hooks
DTrace has a little bit of functionality that lives in the core kernel. The code to deal with
invalid memory accesses,
some glue between the kernel's dynamic linker and some of the DTrace instrumentation providers, and some
simple, low-level routines
cover the bulk of it. My guess is there are about 1500 lines of code all told: not trivial, but hardly insurmountable. Google implements these facilities in a manner designed to allow the results to be licensed under the GPL. For example, I think it would be sufficient for someone to draft a specification and for someone else to implement it so long as the person implementing it hadn't seen the CDDL version. Google then posts the patch publically.
DTrace Kernel Modules
The other DTrace kernel components are divided into several loadable kernel modules. There's the main DTrace module and then the instrumentation provider modules that connect to the core framework through an internal interface. These constitute the vast majority of the in-kernel DTrace code. Google modifies these to use slightly different interfaces (e.g. mutex_enter() becomes mutex_lock()); the final result is a collection of kernel modules still licensed under the CDDL. Of course, Google posts any modifications to CDDL files.
DTrace Libraries and Commands
It wouldn't happen for free, but the DTrace user-land components could just be directly ported. I don't believe there are any legal issues here.
So let's say that this is Google's DTrace port: their own hacked up kernel, some kernel modules operating under a non-GPL license, and some user-land components (also under a non-GPL license, but, again, I don't think that matters). Now some questions:
1. Legal To Run?
If Google assembled such a system, would it be legal to run on a development desktop machine? It seems to violate the GPL no more than, say, the nVidia drivers (which are presumably also running on that same desktop). What if Google installed the port on a customer-facing machine? Are there any additional legal complications there? My vote: legit.
2. Legal To Distribute?
Google distributes the Linux kernel patch (so that others can construct an identical kernel), and elsewhere they distribute the Linux-ready DTrace modules (in binary or source form): would that violate either license? It seems that it would potentially violate the GPL if a full system with both components were distributed together, but distributed individually it would certainly be fine. My vote: legit, but straying into a bit of a gray area.
3. Patch Accepted?
I'm really just putting this here for completeness. Google then submits the changes to the Linux kernel and tries to get them accepted upstream. There seems to be a precedent for the Linux kernel not accepting code that's there merely to support non-GPL kernel modules, so I doubt this would fly. My vote: not gonna happen.
4. No Source?
What if Google didn't supply the source code to either component, and didn't distribute any of it externally? My vote: legal, but morally bankrupt.
You Make The Call
So what do you think? Note that I'm not asking if it would be "good", and I'm not concluding that this would obviate the need for direct support for a native dynamic tracing framework in the Linux kernel. What I want to know is whether or not this DTrace port to Linux would be legal (and why)? If not, what would happen to poor Google (e.g. would FSF ninjas storm the Googleplex)?
If you care to comment, please include some brief statement about your legal expertise. I for one am not a lawyer, have no legal background, have read both the GPL and CDDL and have a basic understanding of both, but claim to be an authority in neither. If you don't include some information with regard to that, I may delete your comment.
(2007-08-06 15:11:11.0/2007-08-06 06:00:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/what_if_machine_dtrace_port
|
|
|
Thursday August 02, 2007 |
|
Update 8/6/2007: Those of you interested in this entry may also want to check out
my next entry on the legality of a hypothetical port of DTrace to Linux.

Tools We Wish We Had -- OSCON 7/26/2007
Last week at OSCON someone set up a whiteboard with the heading "Tools We Wish We Had". People added entries (wiki-style); this one in particular caught my eye:
dtrace for Linux
or something similar
(LIKE SYSTEMTAP?)
- jdub
(NO, LIKE dtrace)
- VLAD
(like systemtap, but not crap)
DTrace
So what exactly were they asking for? DTrace is the tool developers and sysadmins have always needed -- whether they knew it or not -- but weren't able to express in words let alone code. Most simply (and least humbly) DTrace lets you express a question about nearly any aspect of the system and get the answer in a simple and concise form. And -- this is important -- you can do it safely on machines running in production as well as in development. With DTrace, you can look at the highest level software such as Ruby (as was the subject of my talk at OSCON) through all the layers of the software stack down to the lowest level kernel facilities such as I/O and scheduling. This systemic scope, production focus, and arbitrary flexibility are completely new, and provide literally unprecedented observability into complex software systems. We're scientists, we're detectives -- DTrace lets us form hypotheses, and prove or disprove them in an instant until we've come to an understanding of the problem, until we've solved the crime. Of course anyone using Linux would love a tool like that -- especially because DTrace is already available on Mac OS X, Solaris, and FreeBSD.
SystemTap
So is SystemTap like DTrace? To understand SystemTap, it's worth touching on the history of DTrace: Bryan cut the first code for DTrace in October of 2001; Mike tagged in moments later, and I joined up after a bit. In September of 2003 we integrated DTrace into Solaris 10 which first became available to customers in November of 2003 and formally shipped and was open-sourced in January of 2005. Almost instantly we started to see the impact in the field. In terms of performance, Solaris has strong points and weak points; with DTrace we were suddenly able to understand where those bottlenecks were on customer systems and beat out other vendors by improving our performance -- not in weeks or months, but literally in a few hours. Now, I'm not saying that DTrace was the silver bullet by which all enemies were slain -- that's clearly not the case -- but it was turning some heads and winning some deals.
Now, this bit involves some hearsay and conjecture[1], but apparently some managers of significance at Red Hat, IBM, and Intel started to take note. "We've got to do something about this DTrace," one of them undoubtedly said with a snarl (as an underling dragged off the fresh corpse of an unlucky messenger). SystemTap was a direct reaction to the results we were achieving with DTrace -- not to DTrace as an innovative technology.
When the project started in January of 2005, early discussion by the SystemTap team referred to "inspiration" that they derived from DTrace. They had a mandate to come up with an equivalent, so I assumed that they had spent the time to truly understand DTrace: to come up with an equivalent for DTrace -- or really to duplicate any technology -- the first step is to understand what it is completely. From day one, DTrace was designed to be used on mission critical systems, to always be safe, to induce no overhead when not in use, to allow for arbitrary data gathering, and to have systemic scope from the kernel to user-land and on up the stack into higher level languages. Those fundamental constraints led to some important, and non-obvious design decisions (e.g. our own language "D", a micro virtual machine, conservative probe point selection).
Instead of taking the time to understand DTrace, and instead of using it and scouring the documentation, SystemTap charged ahead, completely missing the boat on safety with an architecture which is nearly impossible to secure (e.g. running a SystemTap script drops in a generated kernel module). Truly systemic scope remains an elusive goal as they're only toe-deep in user-land (forget about Ruby, Java, python, etc). And innovations in DTrace such as scalable data aggregation and speculative tracing are replicated poorly if at all. By failing to examine DTrace, and by rushing to have some sort of response, SystemTap isn't like DTrace: it's a knockoff.
Amusingly, in an apparent attempt to salvage their self-respect, the SystemTap team later renounced their inspiration. Despite frequent mentions of DTrace in their
early meetings and email, it turns out,
DTrace didn't actually inspire them much at all:
CVSROOT: /cvs/systemtap
Module name: src
Changes by: kenistoj@sourceware.org 2006-11-02 23:03:09
Modified files:
. : stap.1.in
Log message:
Removed refs to dtrace, to which we were giving undue credit in terms of
"inspiration."

you're not my real dad! <slam>
Bad Artists Copy...
So uninspired was the SystemTap team by DTrace, that they don't even advocate its use according to a presentation on profiling applications ("Tools that we avoid - dtrace [sic]"). In that same presentation there's an example of a SystemTap-based tool called udpstat.stp:
$ udpstat.stp
UDP_out UDP_outErr UDP_in UDP_inErr UDP_noPort
0 0 0 0 0
0 0 0 0 0
4 0 0 0 0
5 0 0 0 0
5 0 0 0 0
... whose output was likely "inspired" by udpstat.d -- part of the DTraceToolkit
by Brendan Gregg:
# udpstat.d
UDP_out UDP_outErr UDP_in UDP_inErr UDP_noPort
0 0 0 0 1
0 0 0 0 2
0 0 0 0 0
1165 0 2 0 0
In another act of imitation reminiscent of liberal teenage borrowing from wikipedia, take a look at Eugene Teo's slides from Red Hat Summit 2007 as compared with Brendan's DTrace Topics Intro wiki (the former apparently being generated by applying a sed script to the latter). For example:
What isn’t SystemTap
- SystemTap isn’t sentient; requires user thinking process
- SystemTap isn’t a replacement for any existing tools
What isn't DTrace
- DTrace isn't a replacement for kstat or SMNP
- kstat already provides inexpensive long term monitoring.
- DTrace isn't sentient, it needs to borrow your brain to do the thinking
- DTrace isn't “dTrace”
... Great Artists Steal
While some have chosen the knockoff route, others have taken the time to analyze what DTrace does, understood the need, and decided that the best DTrace equivalent would be... DTrace. As with the rest of Solaris, DTrace is open source so developers and customers are excited about porting. Just a few days ago there were a couple of interesting blog posts (here and
here) by users of ONTAP, NetApp's appliance OS, not for a DTrace equivalent, but for a port of DTrace itself.
DTrace is already available in the developer builds of Mac OS X 10.5, and there's a functional port for FreeBSD. I don't think it's a stretch to say that DTrace itself is becoming the measuring stick -- if not the standard. Why reinvent the wheel when you can port it?
Time For Standards
At the end of my talk last week someone asked if there was a port of DTrace to Linux (not entirely surprising since OSCON has a big Linux user contingent). I told him to ask the Linux bigwigs (several of them were also at the conference); after all, we didn't do the port to Mac OS X, and we didn't do the port to FreeBSD. We did extend our help to those developers, and they, in turn, helped DTrace by growing the community and through direct contributions[2].
We love to see DTrace on other operating systems, and we're happy to help.
So to the pretenders: enough already with the knockoffs. Your users want DTrace, you obviously want what DTrace offers, and the entire DTrace team and community are eager to help. I'm sure there's been some FUD about license incompatibilities, but it's certainly Sun's position (as stated by Sun's CEO Jonathan Schwartz at OSCON 2005) that such a port wouldn't violate the OpenSolaris license. And even closed-source kernel components are tolerated from the likes of Symantec (nee Veritas) and nVidia. Linux has been a champion of standards, eschewing proprietary solutions for free and open standards. DTrace might not yet be a standard, but a DTrace knockoff never will be.
[1] ... those are kinds of evidence
[2] including posts on the DTrace discussion forum comprehensible only to me and James
(2007-08-06 12:55:53.0/2007-08-02 11:36:16.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_knockoffs
|
|
|
Tuesday July 31, 2007 |
|
DTrace for Ruby at OSCON 2007
I just got back from OSCON, a conference on Open Source that O'Reilly hosts in Portland annually. The conference offered some interesting content and side-shows with some notable highlights (more on those in the next few days). Brendan and I gave a presentation on how
a
crew
from
Sun
dropped in on Twitter to help them use
DTrace to discover some nasty performance problems.
Here's the presentation along with
the D scripts and load generators
we used for the talk.
(2007-08-01 12:44:49.0/2007-07-31 10:53:59.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_for_ruby_at_oscon
|
|
|
Thursday July 05, 2007 |
|
Robert Scoble was kind enough to interview us last week for the ScobleShow. Robert pretty much let us riff continuously for half an hour -- we clearly haven't been getting to talk about DTrace enough lately. I thought he would trim it down a bit, but like a scene from Hard Boiled, it's all there.
This picture captures my favorite moment (around 16:23) during the interview as the three of us try to formulate a connection between DTrace and green computing...
... and this is as good a time as any to plug the talk Brendan and I will be giving at the end of the month at OSCON. We'll be talk about how team DTrace was able to solve some nasty Ruby scalability problems at Twitter; it's in the Ruby track, but the principles apply for analysis of performance problems in all languages.
(2007-07-31 10:54:25.0/2007-07-05 14:13:47.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_scobleized
|
|
|
Wednesday July 04, 2007 |
|
iSCSI DTrace provider and more to come
People often ask about the future direction of DTrace, and while we have some stuff planned for the core infrastructure, the future is really about extending DTrace's scope into every language, protocol, and application with new providers -- and this development is being done by many different members of the DTrace community. An important goal of this new work is to have consistent providers that work predictably. To that end, Brendan and I have started to sketch out an array of providers so that we can build a consistent model.
In that vein, I recently integrated a provider for our iSCSI target into Solaris Nevada (build 69, and it should be in a Solaris 10 update, but don't ask me which one). It's an USDT provider so the process ID is appended to the name; you can use * to avoid typing the PID of the iSCSI target daemon. Here are the probes with their arguments (some of the names are obvious; for others you might need to refer to the iSCSI spec):
| probe name | args[0] | args[1] | args[2] |
iscsi*:::async-send | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::login-command | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::login-response | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::logout-command | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::logout-response | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::data-receive | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::data-request | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::data-send | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::nop-receive | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::nop-send | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::scsi-command | conninfo_t * | iscsiinfo_t * | iscsicmd_t * |
iscsi*:::scsi-response | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::task-command | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::task-response | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::text-command | conninfo_t * | iscsiinfo_t * | - |
iscsi*:::text-response | conninfo_t * | iscsiinfo_t * | - |
The argument structures are defined as follows:
typedef struct conninfo {
string ci_local; /* local host address */
string ci_remote; /* remote host address */
string ci_protocol; /* protocol (ipv4, ipv6, etc) */
} conninfo_t;
typedef struct iscsiinfo {
string ii_target; /* target iqn */
string ii_initiator; /* initiator iqn */
uint64_t ii_lun; /* target logical unit number */
uint32_t ii_itt; /* initiator task tag */
uint32_t ii_ttt; /* target transfer tag */
uint32_t ii_cmdsn; /* command sequence number */
uint32_t ii_statsn; /* status sequence number */
uint32_t ii_datasn; /* data sequence number */
uint32_t ii_datalen; /* length of data payload */
uint32_t ii_flags; /* probe-specific flags */
} iscsiinfo_t;
typedef struct iscsicmd {
uint64_t ic_len; /* CDB length */
uint8_t *ic_cdb; /* CDB data */
} iscsicmd_t;
Note that the arguments go from most generic (the connection for the application protocol) to most specific. As an aside, we'd like future protocol providers to make use of the conninfo_t so that one could write a simple script to see a table of frequent consumers for all protocols:
iscsi*:::,
http*:::,
cifs:::
{
@[args[0]->ci_remote] = count();
}
With the iSCSI provider you can quickly see which LUNs are most active:
iscsi*:::scsi-command
{
@[args[1]->ii_target] = count();
}
or the volume of data transmitted:
iscsi*:::data-send
{
@ = sum(args[1]->ii_datalen);
}
Brendan has been working on a bunch of iSCSI scripts -- those are great for getting started examining iSCSI
(2007-07-04 08:35:30.0/2007-07-04 00:41:32.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/iscsi_dtrace_provider_and_other
|
|
|
Monday May 28, 2007 |
|
This year, Jarod Jenson and I gave an updated version of our DTrace for Java (technology-based applications) talk:
The biggest new feature that we demonstrated is the forthcoming Java Statically-Defined Tracing (JSDT) which will allow developers to embed stable probes in their code as we can do today in the kernel with SDT probes and in C and C++ applications with USDT probes. While you can already trace Java applications (and C and C++ applications), static probes let the developer embed stable and semantically rich points of instrumentation that allow the user to examine the application without needing to understand its implementation. The Java version of this is so new I had literally never seen it until Jarod gave a demonstration during our talk. The basic idea is that you can define a new probe by constructing a USDTProbe instance specifying the provider, function, probe name, and argument signature:
sun.dtrace.USDTProbe myprobe = new sun.dtrace.USDTProbe("myprovider", "myfunc", "myprobe", "ssl");
To fire the probe, you invoke the Call() method on the instance, and pass in the appropriate arguments.
Attendance was great, and we talked to a lot of people who had attended last year and had been getting mileage out of DTrace for Java. Next year, we're hoping to give the updated version of this talk on Tuesday (rather than Friday for once) and invite people to bring in their applications for a tune-up; we'll present the results in a case study-focussed talk on Friday.
(2007-05-28 22:39:18.0/2007-05-28 22:36:42.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_javaone_2007
|
|
|
Thursday March 22, 2007 |
|
The Texas Ranger himself, Jarod Jenson, has written a nice article about using the new DTrace probes in Java SE 6. If that's up your alley, you should come to the talk Jarod and I will be giving at JavaOne in May. We'll be talking about some of the new features in Java SE 6 and potentially previewing some new features slated for Java SE 7. This will be our third year at JavaOne -- it's great to see how much progress we're making each year.
Technorati Tags:
DTrace
Java
(2007-03-23 00:59:42.0/2007-03-22 23:59:42.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/java_dtrace_article
|
|
|
Monday March 19, 2007 |
|
Ian Murdock has left the Linux Foundation to lead the operating systems
strategy here at Sun. The last few years have seen some exciting changes
at Sun: releasing Solaris 10 (which includes several truly revolutionary
technologies), embracing x86, leading on x64, and taking Solaris open source.
That a luminary of the Linux world was enticed by the changes we've made
and the technologies we're creating is a huge vote of confidence. From my
(admittedly biased) view, OpenSolaris has been breaking away from the pack with
technologies like DTrace, ZFS, Zones, SMF, FMA and others. I'm looking forward
to the contributions Ian will bring from his experience with Debian, and to
the wake-up call this may be to Linux devotees: OpenSolaris is breaking away.
(2007-03-19 11:33:51.0/2007-03-19 10:24:02.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/linux_defection
|
|
|
Wednesday January 31, 2007 |
|
The other day I posted about a prototype I had created that adds a gzip compression algorithm to ZFS. ZFS already allows administrators to choose to compress filesystems using the LZJB compression algorithm. This prototype introduced a more effective -- albeit more computationally expensive -- alternative based on zlib.
As an arbitrary measure, I used tar(1) to create and expand archives of an ON (Solaris kernel) source tree on ZFS filesystems compressed with lzjb and gzip algorithms as well as on an uncompressed ZFS filesystem for reference:
Thanks for the feedback. I was curious if people would find this interesting and they do. As a result, I've decided to polish this wad up and integrate it into Solaris. I like Robert Milkowski's recommendation of options for different gzip levels, so I'll be implementing that. I'll also upgrade the kernel's version of zlib from 1.1.4 to 1.2.3 (the latest) for some compression performance improvements. I've decided (with some hand-wringing) to succumb to the requests for me to make these code modifications available. This is not production quality. If anything goes wrong it's completely your problem/fault -- don't make me regret this. Without further disclaimer:
pdf
patch
In reply to some of the comments:
UX-admin
One could choose between lzjb for day-to-day use, or bzip2 for heavily compressed, "archival" file systems (as we all know, bzip2 beats the living daylights out of gzip in terms of compression about 95-98% of the time).
It may be that bzip2 is a better algorithm, but we already have (and need zlib) in the kernel, and I'm loath to add another algorithm
ivanvdb25
Hi, I was just wondering if the gzip compression has been enabled, does it give problems when an ZFS volume is created on an X86 system and afterwards imported on a Sun Sparc?
That isn't a problem. Data can be moved from one architecture to another (and I'll be verifying that before I putback).
dennis
Are there any documents somewhere explaining the hooks of zfs and how to add features like this to zfs? Would be useful for developers who want to add features like filesystem-based encryption to it. Thanks for your great work!
There aren't any documents exactly like that, but there's plenty of documentation in the code itself -- that's how I figured it out, and it wasn't too bad. The ZFS source tour will probably be helpful for figuring out the big picture.
Update 3/22/2007: This work was integrated into build 62 of onnv.
Technorati Tags:
ZFS
OpenSolaris
(2007-04-20 16:32:52.0/2007-01-31 22:30:07.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/gzip_for_zfs_update
|
|
|
|
| « May 2008 | | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|
| | | | | 1 | 2 | 3 | 4 | | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | | | | | | | | | Today |
ahl at sun
Recent Entries
|