Views on software from Bryan Cantrill's deck chair
The Observation Deck
| ||
Wednesday Jun 30, 2004
Just got back from the Solaris BOF here at USENIX '04 -- it was great to see so many people there! Hopefully it was useful to you; at the very least, you learned about a beautiful piece of Solaris documentation -- not to mention how I feel about compiler folks. You shouldn't have any trouble finding more information on DTrace, Zones, or Predictive self-healing; information on ZFS and the service management facility will be available after they become publicly available (which should be quite soon). Thanks for coming tonight -- and thanks for all of the kind words tonight about the technology in Solaris 10 in general, and about DTrace in particular. It's tremendously rewarding for us to see you so excited about these technologies!
Tuesday Jun 29, 2004
So the DTrace team is currently at USENIX '04, where yesterday we presented our paper on DTrace. The presentation went quite well -- though it's a bit difficult to jam so much content in 25 minutes! The reception to the work was very positive, and even the questions largely praised DTrace. The only wrinkle in the whole operation came with the last question, from an employee of IBM. To paraphrase:
DTrace seems great, I imagine many people would be interested in this, etc. etc. When are you going to port it to Linux? I'm afraid that my answer was probably perceived as a bit politically incorrect; it was roughly:
Look: we believe in choice; we believe that people should pick the best operating system for the task at hand. We've been busting butt for the last three years on DTrace to make Solaris the best operating system for many different tasks. Enough said... To be clear: this is not an attack on Linux. But there is a fundamental disagreement out there: many seem to believe that the operating system is a "commodity" -- that all operating systems are basically the same. We disagree: we believe that the operating system is a nexus of innovation. And we believe that we're proving that with Solaris 10 technologies like DTrace, Zones, ZFS, etc. etc. You certainly don't have to agree with us; we believe that you always have the right to choose Linux and that there may be many great reasons to do so. But please stop asking if we're going to port these features to Linux -- if you want to take advantage of our OS innovation, run Solaris. Sunday Jun 20, 2004
It normally doesn't -- and it probably shouldn't. But Chris Preimesberger, the author of the story on Solaris 10 that I mentioned on Friday, has apparently read (and commented on) my blog entry -- and he corrected the story! The details of the paragraph are now correct, and I'm certainly happier.
This whole episode has given me insight into the unique tribulations of being a technical reporter. It must take a thick skin to be dealing with people like me all of the time: fast talkers who bombard with technical detail, and then expect absolute accuracy in whatever stories emerge. Of course, from my perspective, the problem is that the readership assumes absolute accuracy -- and if the technical details are incorrect, they will naturally blame the technology (or worse, the technologist) instead of questioning the accuracy of the reportage. Anyway, Chris: thanks for correcting the story; it's much appreciated. Friday Jun 18, 2004
Earlier, I lamented the fact that a press roundtable on three key technology areas in Solaris 10 (DTrace, Zones and ZFS) had yielded only stories about open source -- a topic which we explicitly didn't talk about. Fortunately, there is now a new story by one of attendees of the roundtable that focuses on the three technology areas.
And even better, the larger points about DTrace are certainly correct, e.g.: DTrace, which uses more than 30,000 data monitoring points in the kernel alone, lets administrators see their entire system in a new way, revealing systemic problems that were previously invisible and fixing performance issues that used to go unresolved.And the example that the article is trying to cite has an absolute basis in fact -- it's discussed in depth in Section 9 of our upcoming USENIX paper. But that said, the details of the specific example are incredibly wrong. (So wrong, in fact, that they're just odd; what does "a wild-card desktop applet that had somehow gotten channeled into the central system" even mean?) Perhaps the terms used are so opaque that readers will come away confused, but with the right overall impression -- but given that readers at LWN.net went so far as to accuse me of being a pointy-haired boss based on the C++ misquote, I can only imagine what I'll be accused of being now... Thursday Jun 17, 2004
Several years ago, Salon.com had a contest for the motto for Silicon Valley. Maurice Herlihy1 won with the slogan "Quality is Job 1.1." Maurice's slogan is certainly clever (and disconcertingly accurate at times), but one of the honorable mentions actually struck me as being truer to Silicon Valley: Eli Neiburger's "God bless the early adopters." If you have ever developed a revolutionary technology -- one that requires people to change the way they think at some level -- you know how unbelievably true this is. For it is the Early Adopter who puts up with tremendous pain to get their hands on a technology, goes through the tedium of constantly communicating the technology's shortcomings to its inventors, endures the slow march towards something usable, and through it all somehow finds the energy to talk enthusiastically about the nascent technology at every opportunity. The Early Adopters are something of a riddle to me, but they're so incredibly important to birthing new technology, that I almost view it as uncouth to dissect what makes them tick. So "God bless the early adopter," indeed. There is no better slogan for Silicon Valley; you were robbed, Eli.
I bring all of this up because one of the great DTrace Early Adopters, Jon Haslam, has joined the Sun blogmania. Jon is a canonical Early Adopter in that he remained a terrific advocate for the technology, even when it was in a painfully unfinished state. We sometimes don't understand what makes Jon tick, but DTrace certainly wouldn't be what it is without him; God bless him... 1Maurice was actually a professor of mine at school; his course on lock- and wait-free synchronization was one of the highlights of my education. The course was a seminar, and one week the low quality of that week's paper led me to decry the generally woeful state of academic computer science: "Maurice," I whined, "95% of it is crap!" "Bryan," he replied, "95% of everything is crap." I conceded the point...
DTrace developer Adam Leventhal has joined the Sun blogging mayhem. Now if we can only convince Mike to start a blog (a feat that can only be compared to getting B.A. Baracus to fly), Team DTrace will be at maximal blogging power...
Wednesday Jun 16, 2004
So another article showed up covering the same press meeting that I discussed earlier. Again, despite the fact that it was less than one tenth of one percent of the content of the rountable, the headline is open source. On the one hand, I feel slightly vindicated in that this one at least quoted me a little more accurately. And then there's this:
"But I'm also sure we'll be revisiting a few comments in the code here and there -- I just thought of a particularly disparaging one I might have left in having to do with C++ unions," Cantrill said with a laugh.This was actually in response to a question -- someone asked (lightheartedly, I had thought) if there were any inappropriate comments in Solaris that would have to be cleaned up. I responded -- indeed with a laugh -- that there were some profane comments I could think of that had to do with unions in C. (Note: C, not C++ -- not that C++ unions don't deserve the same coarse words, only that C++ has bigger problems than just unions.) Needless to say, I was quite surprised to see this off-hand comment show up -- and the reason for the disparaging comment probably deserves a little context. The comment in question is in code that I wrote for a loadable module in mdb, the Solaris modular debugger. This particular code is part of postmortem object type identification, a mechanism for identifying arbitrary memory objects from a system crash dump. I actually wrote a paper on this, and presented it at AADEBUG 2003 in September. Anyway, if you read the paper, you'll see why I was feeling malice towards unions when I wrote the comment: the presence of unions makes type identification needlessly difficult. So that's the explanation for the disparaging comment about unions, for whatever that's worth. And I'm still holding out some hope that we'll see an article on the actual technical content that we presented, and not the open sourcing of Solaris that we refused to talk about...
Thanks to those of you who joined us for the Expert Exchange this morning. Adam and I were typing furiously trying to keep up with all of your questions! In fact, we were typing so furiously that about a third of the way into the session, the "y" key on my laptop stopped working. No capital Y, no lower-case y, no y's whatsoever! Panicky, I tried about four sentences with a y in the X cutbuffer (pasting when I needed a y), but I discovered that -- like many of the letters that you wouldn't pick first on Wheel of Fortune -- y's frequency is still sufficiently high to make living without it nearly impossible. (Three y's in that last clause alone!) So I took out my car keys, pried off my "y" key, took off the hard plastic mechanism underneath it, and discovered that I could still get a y if I prodded the soft plastic underbelly of what used to be my "y" key. I picked up midsentence of the answer I was working on, and (fortunately) I made it through the rest of the chat without losing any other letters...
Anyway, hopefully you got a chance to ask that nagging question you had on DTrace. If you didn't, don't worry -- just head over to the DTrace forum and post it over there. Having said that, it looks like someone just did; off to the forum...
So several of us spoke with analysts and members of the press yesterday on Solaris 10. The idea was that it would be a deep-dive on three of the major technology areas in Solaris 10: DTrace, Zones, and ZFS (a.k.a. the "Dynamic File System"). Of course, at the outset, the press was really only interested in our (pre-)announcements about open sourcing Solaris. We had to spend two minutes at the beginning of the meeting saying yes, we were committed to it, and no, they weren't going to get any additional information out of us. But I guess I shouldn't be too surprised that one of the stories to come out of that roundtable headlined with open source. From the story, you would think that we spent the entire time talking about open source -- it reality, we spent the entire time talking about the three technology areas, and the first several minutes explaining that we explicitly weren't talking about open source. Oh well...
And for the record, I didn't say "technically, it is not a problem to do this", I said "this is not a technical problem." To me, these have different connotations. I am also attributed in that article with "[w]e're engineers and we've written the cleanest code and we can't wait to share it with the world." While this expresses my sentiments accurately, my phraseology got a bit mangled. (For starters, I don't generally speak in run-on sentences!) What I said is more like: "We're engineers; we obviously understand the value of having the source code. We believe that we have some of the cleanest code anywhere, and we're looking forward to showing it to the world." (And hey: we do have some of the cleanest code anywhere -- but somehow I don't think that we're going to see a story about the beautiful ASCII art block comments in /usr/include/sys/dtrace_impl.h...) Tuesday Jun 15, 2004
Adam Leventhal, Mike Shapiro and I will be on-line tomorrow (Wednesday) at 10a Pacific for an Expert Exchange on DTrace. This is your opportunity to have your questions answered on-line by the people who wrote the code. (Which is to say, us.) Hope to see you (virtually) tomorrow!
Monday Jun 14, 2004
I am an engineer in Solaris Kernel Development here at
But first, some prehistory to let you know where I'm coming from. (And apologies in advance for the length.) Above all else, I believe that software should:
foo(int arg)
{
if (tracing_enabled)
trace(FOO_ENTRY, arg);
...
This boils down to instructions that look something like this (using a RISC-ish proto instruction set):
set tracing_enabled, %o0
ld [%o0], %o1
cmp %o0, 0
bne go_around
set FOO_ENTRY, %o0
mov %i0, %o0
call trace
...
go_around:
!
! The rest of the function foo()...
!
...
That is, it boils down to a load, a compare and a branch. This slows down execution (loads hurt -- especially if they're not in the cache), causes a branch-taken in the common path, increases instruction cache footprint, etc. Not a big deal if foo() is the only function in which you do this to, but start putting this in every function, and you'll find that you have a system that is too slow to ship -- it has suffered the infamous "death of a thousand cuts."
(Yes, if you're lucky or if the compiler supports a hint to indicate that trace() isn't called in the common case, the sense of the branch may change such that the branch will be not-taken in the common case -- which is better, but this still hurts too much to do it everywhere.) So we can't leave this kind of code in our optimized, production code, so what do we do? Many do something like this: #ifdef DEBUG #define TRACE(tok, arg) if (tracing_enabled) trace((tok), (arg)) #else #define TRACE(tok, arg) #endifThis is better -- at least the production version isn't hindered and we still have debug support in a DEBUG version. But now we have a new problem. We now have essentially two versions of our software: the slow one that we can see, and the fast one that we can't. So what do we do when we see a performance problem in production? Well, we might try to run the DEBUG version (or worse, one with custom instrumentation) in production. But that requires downtime, and additional risk -- and usually doesn't fly. So what do we do? We try to reproduce this in development on the DEBUG version that we can see. (This is not "we" as in Sun, by the way, this is "we" as in humanity -- you and me and everyone.) And reproducing perfomance problems is bad, bad news: when you're reproducing performance problems, you're reproducing symptoms. (Naturally, because the symptom are all you've got; if you knew the root-cause you would be watching Knight Rider reruns instead of horsing around at work.) And why is reproducing symptoms such bad news? Because disjoint problems can manifest the same symptoms. To borrow a medical analogy: let's say that you discover that you're running a fever in production. So you take your development or test environment, and you try to make it look closer and closer to your production environment until you have a fever. Maybe you add more artificial load, more hardware, more users, whatever. Finally, you see the fever in your development environment. You get all of the developers in the room, and they start throwing instrumented binaries on the development machine. Maybe you think you've got an OS issue, so you have Solaris engineers throwing on new kernels -- or maybe you have your ISVs giving you instrumented binaries of their products. Finally, after a huge amount of time and escalation and more time and frustration, you discover the problem: the fever is due to influenza. Okay, this isn't the end of the world: if the production environment stays off its feet, drinks fluids and gets some rest for the next few weeks, it should be fine. But here's the problem: it was influenza in the development environment -- that much was correct. But it's not influenza in the production environment. In the production environment, the problem was cerebral malaria. No amount of rest is going to help -- our diagnosis is completely wrong.2 It may strike you as a glib analogy, but it's an accurate one for the experiences of many. Just think: how many times have you found "a" problem without finding "the" problem? Okay, so we're clearly down a blind alley of sorts. We actually need to start over with a clean sheet of paper -- we need to change the model. We need to be able to ship an optimized kernel and optimized apps, and when we want to be able to see the software, we need to be able to dynamically instrument it to answer our question. And when we're done answering our question, we want the system to go back to be completely optimized. And we want to do all this in production environments. This means that is must be absolutely safe -- there must be no way to crash the system through user error. And this, in essense, is what we've done with DTrace. DTrace is a new facility in Solaris 10 for the dynamic instrumentation of production systems. DTrace is available today via Solaris Express. It has been available since November, and many people have already used it to diagnose real problems. You can read some of their thoughts in the DTrace feature story that ran on sun.com late last month. 1 I would have hyperlinked to Mike and Adam's blogs, but they don't (yet) exist. I would expect Adam to have a blog shortly, but given that Mike doesn't yet have a cell phone, it might be a longer wait. Then again, Mike bought a TiVo the first weekend they were on sale at the Palo Alto Fry's back in 1998 -- so you never know when he's going to adopt a technology. 2 Lest you think medical science has figured this one out: I encourage you to contract cerebral malaria and present at your local emergency department -- and observe just how many weeks you spend bouncing around the health care system before some clever doc finally cracks the case... |
||