|
In case you missed it, Matt Ahrens has a great post talking about ZFS. It's nice to see this conversation finally beginning; we have been itching to see this out in the public for far too long. Every step we take (Solaris Express, blogs.sun.com, OpenSolaris) brings us closer to our customers and the Solaris community. I'm looking forward to more from the ZFS team (and I promise not to steal their thunder).
I've talked a lot in the past about kernel debugging, usually comparing the Solaris tools against what's available in Linux and other Operating Systems. I thought that I'd walk through a concrete example of kernel debugging, first with our debugger KMDB (which replaced the venerable kadb in build 61 of Solaris 10), and then with the available Linux kernel debuggers (KDB and KGDB). I'll also examine port-mortem analysis tools for Linux.
For today's lesson, I've picked a problem where a port-mortem or live kernel debugger is absolutely necessary. Those of you who have done multithreaded programming in the past have most likely had to deal with a deadlock here or there. In your garden variety deadlock, one thread grabs lock A and then lock B, while another thread grabs lock B and then lock A. If they both get their first lock but not their second, they'll both be blocked waiting for locks which will never be released. I've decided to spice it up a bit by looking at reader-writer locks. With an rwlock, you can have any number of readers, but only one writer. This makes problems much more difficult to debug. With a mutex, the current owner is stored as part of the data structure. But with a rwlock, the only indication you have is the number of active readers.
In our fantasy land, we have the following situation: commands are getting stuck in the kernel. Forget kill -9; if you're stuck waiting on a kernel synchronization primitive (wihtout accepting interruption by signals), you're in trouble. The timeline for our threads looks something like this:
| thread 1 | thread 2 |
rw_enter(rwlock, RW_READER)
<stuck somewhere> |
rw_enter(rwlock, RW_WRITER)
<blocked> |
Loading KMDB
Something's ill on our system. We know that our 'ps' command is hanging for some reason. So we enter KMDB:
# mdb -K
Loaded modules: [ ufs unix krtld nca lofs genunix ip usba specfs nfs random sctp ]
[0]>
KMDB is ready to be loaded on every system running Solaris 10 (build 61 or later). You can also boot -k which will load kmdb in the background. If you do this, you can send a break (via a tip line or L1/F1-A) to get into KMDB.
Find our troubled thread
In this example, we know that 'ps' is hanging. Finding the troubled stack is easy:
[0]> ::pgrep ps | ::walk thread | ::findstack
rw_enter_sleep+0x144()
as_pageunlock+0x48()
default_physio+0x340()
pread+0x244()
syscall_trap+0x88()
Get the address of the rwlock
This takes a little bit of work because we don't have the necessary debugging information in the kernel to restore arguments from arbitrary positions within a function. In this example we're at rw_enter_sleep+0x144. Using ::findstack -v or $C we can get the frame pointer for each frame. At this point, we have to examine the disassembly around each call site to find out where our first parameter came from. We've been kicking around some ideas to do this automatically, but I'll owe Matt another nickel if I mention it out loud... At the moment, you need to do something similar to the following (on SPARC - x86 is similar):
[0] > as_pageunlock+48::dis
as_pageunlock+0x20: mov 3, %i4
as_pageunlock+0x24: call -0x2318
as_pageunlock+0x28: restore
as_pageunlock+0x2c: add %l5, 0x30, %l0
as_pageunlock+0x30: sethi %hi(0x1069400), %i5
as_pageunlock+0x34: sethi %hi(0x1069400), %i1
as_pageunlock+0x38: ldx [%i5 + 0x1e8], %l6
as_pageunlock+0x3c: mov %l0, %o0
as_pageunlock+0x40: ldx [%i1 + 0x1e0], %i0
as_pageunlock+0x44: add %i2, %i3, %i3
as_pageunlock+0x48: call -0x143754
as_pageunlock+0x4c: mov 1, %o1
as_pageunlock+0x50: mov %l5, %o0
as_pageunlock+0x54: mov %i2, %o1
as_pageunlock+0x58: call -0x2cdc
as_pageunlock+0x5c: clr %o2
as_pageunlock+0x60: add %i3, %i0, %l7
as_pageunlock+0x64: and %i2, %l6, %l3
as_pageunlock+0x68: ldx [%o0 + 0x38], %i2
as_pageunlock+0x6c: and %l7, %l6, %l4
as_pageunlock+0x70: mov %l3, %o1
[0]> 000002a101344ee1+7ff::print struct frame
{
fr_local = [ 0x3000eda5358, 0, 0x1, 0xfd1ff, 0x30018c3dec8, 0xb, 0x8, 0x1 ]
fr_arg = [ 0x3df8515a500, 0xb, 0x2, 0x300068202d8, 0x3000eda5358, 0x1 ]
fr_savfp = 0x2a101344f91
fr_savpc = 0x1186134
fr_argd = [ 0, 0x810, 0x102, 0, 0x3001503a1a0, 0x2a101345930 ]
fr_argx = [ 0x300021541d8 ]
}
[0]> 0x3000eda5358::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
3000eda5358 READERS=1 B011 30018878f20 (R)
|| 30018c86000 (R)
WRITE_WANTED -------+| 30018c86300 (R)
HAS_WAITERS --------+ 30018c86600 (R)
30018c86900 (R)
30018c86f00 (R)
30018c87200 (R)
[0]>
Find who owns the reader lock
Here's another place where things get tough. Since we don't record readers as part of the rwlock_t (only the count), we can't just look at memory and see who owns this. We could just dump the stack of every thread and try to guess who's the owner, but on a large server this quickly becomes impractical. This is where you need ::kgrep, which will find references to a given pointer within the kernel. From there, we pipe it to ::whatis (or more recently, ::whattype), which can tell us in which stacks the value is referenced. This dramatically cuts down our list of choices. After some guess-and-check (but significantly less than looking at every thread in the system), we find the real culprit of the crime.
[0]> ::kgrep 30006820000 | ::whatis
2a100527730 is in thread 30013e04920's stack
2a100527770 is in thread 30013e04920's stack
2a100527780 is in thread 30013e04920's stack
2a100ecf730 is in thread 3000f962300's stack
2a100ecf770 is in thread 3000f962300's stack
2a100ecf780 is in thread 3000f962300's stack
2a101171730 is in thread 3001506db60's stack
2a101171770 is in thread 3001506db60's stack
[...]
[0]> 30013e05220::findstack
stack pointer for thread 30013e05220: 2a1011a09a1
[ 000002a1011a09a1 sema_p+0x140() ]
000002a1011a0a51 biowait+0x60()
000002a1011a0b01 ufs_getpage_miss+0x2f0()
000002a1011a0c01 ufs_getpage+0x6a8()
000002a1011a0d61 fop_getpage+0x48()
000002a1011a0e31 segvn_fault+0x834()
000002a1011a0fe1 as_fault+0x4a0()
000002a1011a10f1 pagefault+0xac()
000002a1011a11b1 trap+0xc14()
000002a1011a12f1 utl0+0x4c()
[0]>
Looking at this now, it makes me want a new MDB command, ::whatthread, which would find all threads with stacks containing a given pointer. With MDB's modular design and stable API, it's easy to connect ::kgrep and ::whatis into a self-contained command. But that's a RFE for another day.
Figure out what our culprit is doing
From here, you're pretty much own your own. In the case outlined about, we're stucking waiting for I/O while we have a read lock on the address space. The gory details are beyond a single blog post, but what's been covered is more than enough for a crash-course in kernel debugging. The best part about this bug is that you could never figure this out with custom instrumentation or 'classic' debugging techniques. This is a race condition of the worst kind - all the printf()s in the world wouldn't have helped you here. And if you need to reboot in order to load your custom kernel, you'll probably never see the problem again.
Post mortem debugging and Linux
On Solaris, the post-mortem case is exactly the same as the live KMDB case. Simply reboot -d (or initiate a sync from KMDB or the OK prompt if it's really hung) and then run mdb on the resulting crash dump. You can also examine the kernel while the system is still running with mdb -k, but some commands (such as ::kgrep) will fail because memory is actively changing.
In upcoming posts, I'll wade through documentation on linux kernel debuggers, trying to figure out how they would solve this same problem. Note that this could be "solved" by wading through the stack traces of every thread on the system, and picking out the troublemaker thread by eye. This requires intimate knowledge of the source code (all of it), is prone to error, and extremely tedious. It's painful on a simple desktop, and quickly becomes impossible for large servers with thousands of threads. Also, keep in mind this is just a demonstration of kernel debugging - the worst bugs are much more complicated.
In a departure from all my previous blog posts, I thought I'd try my hand at a personal entry. Yesterday, the Olympic Track and Field competitions began, with the U.S. taking Silver in the Men's shot put. What was supposed to be a sweep ended up in disaster: Cantwell never qualified at the trials, Godina fouled his first two and didn't make it to the finals, and Nelson fouled all but his first throw. If you're wondering how I know all of this, it's becuase I'm a track nut. I've been running track and field since sophmore year of high school, and by this point know almost all the men's world records, and a fair number of the women's.
Back in high school, I ran everything; most notably the 110 hurdles, 300 hurdles, and triple jump. In college I was a walk-on, and focused solely on the triple jump (with a few 4x400 legs sprinkled here and there). The Olympic triple jump preliminaries start tomorrow. Despite the craziness at the U.S. Olympic trials, I'd put my money on Christian Olsson, with Jadel Gregario and Kenta Bell rounding out the top three.
Back in college, I was a fairly mediocre Division-I athlete, managing to jump 14.61 (47'11.5") at the Ivy League championships my sophmore year. In contrast, Olsson's best is 17.83 (58'6") and the world record is a whopping 18.29 (60'0"). When you have a moment, try measuring out 20 yards and imagine traversing the distance in 3 steps.
For your amusement, I'll leave you with some track pictures, courtesy of Dan Grossman, a great friend of Brown Track and Field. Some of these are certainly less flattering than others, thanks to the speedsuit and my unique facial expressions.
Harvard 2001
Sean Thomas and myself
Possibly one of my two good jumps sophmore year
Warming up at Heps 2001
Me not getting my feet out in front of me
More jumping at Yale
Indoor Heps 2002 (with bleached hair)
More indoor heps
Ridiculously cold meet at Harvard
Heps champsionships @Navy 2002
A very bad jump at Heps
By this point, I hope you've enjoyed my humiliation. Next post I'll get back to some real issues, including kernel debugging and the joys of KMDB.
I just ran across this interview with Linus. If you've read my blog, you know I've talked before about OS innovation, particularly with regards to Linux and Solaris. So I found this particular Linus quote very interesting:
There's innovation in Linux. There are some really good technical features that I'm proud of. There are capabilities in Linux that aren't in other operating systems. A lot of them are about performance. They're internal ways of doing things in a very efficient manner. In a kernel, you're trying to hide the hard work from the application, rather than exposing the complexity.
This seems to fly in the face of his previous statements, where he claimed that there's no innovation left at the OS level. Some people have commented on my previous blog entry that Linux cannot innovate simply because of its size: with so many hardware platforms to support, it becomes impossible to integrate large projects. What you're left with is "technical features" to improve performance, which (in general) are nothing more than algorithm refinement and clever tricks1.
I also don't buy the "supporting lots of platforms means no chance for innovation" stance. Most of the Solaris 10 innovations (Zones, Greenline, Least Privileges) are completely hardware independent. Those projects which are hardware dependent develop a platform-neutral infrastructure, and provide modules only for those platforms which they suppot (FMA being a good example of this). Testing is hard. It's clear that Linux testing often amounts to little more than "it compiles". Large projects will always break things; even with our stringent testing requirements in Solaris, there are always plenty of "follow on" bugfixes to any project. If Linux (or Linus) is afraid of radical change, then there will be no chance for innnovation2.
As we look towards OpenSolaris, I'm intrigued by this comment by Linus:
I am a dictator, but it's the right kind of dictatorship. I can't really do anything that screws people over. The benevolence is built in. I can't be nasty. If my baser instincts took hold, they wouldn't trust me, and they wouldn't work with me anymore. I'm not so much a leader, I'm more of a shepherd. Now all the kernel developers will read that and say, "He's comparing us to sheep." It's more like herding cats.
Stephen has previously questioned the importance of a single leader for an open source project. In the course of development for OpenSolaris, we've looked at many open source models, ranging from the "benevolent dictatorship" to the community model (and perhaps briefly, the pornocracy). I, for one, question Linus' scalability. Too many projects like crash dumps, kernel debugging, and dynamic tracing have been rejected by Linus largely because Linux debugging is "as good as it needs to be." Crash dumps and kernel tracing are "vendor features," because real kernel developers should understand the code well enough to debug a problem from a stack trace and register dump. I love the Solaris debugging tools, but I know for a fact that there are huge gaps in our capabilities, some of which I hope to fill in upcoming releases. I would never think of Solaris debugging as perfect, or beyond reproach.
So it's time for Space Ghosts's "something.... to think about....":
Is it beneficial to have one man, or one corporation, have the final say in the development of an open source project?
1ZFS is an example of truly innovative performance improvements. Once you chuck the volume manager, all sorts of avenues open up that nobody's ever though about before.
2To be clear, I'm not an ardently anti-Linux. But I do believe that technology should speak for itself, irregardless of philosophical or "religious" issues. Obviously, I'm part of the Solaris kernel group, and I believe that Solaris 10 has significant innovations that no other OS has. But I also have faith in open source and the open development model. If you know of Linux features that are truly innovative (and part of the core operating system), please leave a comment and I'll gladly acknowledge them.
In the footnote a few days ago, I commented on the fact that the history of Solaris debugging could rougly be divded into three 'eras'. As someone interested in UNIX history, I decided to dig through the Solaris archives and put together a chronology of Solaris debuggability and observability tools. For fun, I divided it into eras to parallel Earth's history. And I swear I'm not out to make anyone feel like a dinosaur (or a prokaryote, for that matter).
I've only been around for one of these "dawn of a new era" arrivals, DTrace1. When one of these revolutionary tools arrive, it's amazing to see how quickly engineers avoid their own past. Try asking Bryan to debug a performance problem on Solaris 9, and you'll probably get some choice phrases politely explaining that while he appreciates the importance of your problem, he would rather throw himself down a slide of broken glass and into a vat of rubbing alcohol. Being the neophyte that I am, I've only ventured into the 'Paleozoic era' on one occasion. After an MDB session on a Solaris 8 crashdump (paraphrased slightly):
$ mdb 0
> ::print
mdb: invalid command '::print': unknown dcmd name
> ::help print
mdb: unknown command: print
> ::please print
mdb: invalid command'::please': unknown dcmd name
> ::ihateyou
$
I quickly ran away screaming, never to return. I think I ended up hiding in a corner of my office for two hours, cradling my DTrace answerbook and whispering "there's no place like home" over and over. I'm still a spoiled brat, but at least I have respect and admiration for those Solaris veterans who crawled through debugging hell so that I could live a comfortable life2. It's also made me feel sorry for the Linux (and Windows) developers out there. Not in the Nelson Muntz "Ha ha! You don't have DTrace!" sense. More like "Poor little guy. It's not his fault his species never evolved opposable thumbs." There are a lot of brilliant Linux developers out there, stuck in a movement that doesn't embrace debugging or observability as fundamental goals. But this post is supposed to be about history, not Linux. So without further ado, my brief history of Solaris (soon to be available in refrigerator magnet form):
| <1989 | HADEAN |
SunOS 4.X |
| |
adb, ptrace, crash |
| 1990 | ARCHAEAN |
SVr4 merge |
| |
/proc |
| |
truss(1) |
| 1991 | |
vtrace |
| |
vmstat(1M) |
| |
iostat(1M) |
| 1992 | |
SOLARIS 2.0 |
| 1993 | |
mpstat |
| |
SOLARIS 2.2 |
| 1994 | |
Kernel slab allocator |
| |
TNF |
| |
basic ptools |
| |
SOLARIS 2.4 |
| 1995 | |
|
| 1996 | |
SOLARIS 2.5.1 |
| PROTEROZOIC |
Next generation /proc |
| |
Userland watchpoints |
| 1997 | |
lockstat(1M) |
| |
pkill and pgrep |
| |
libproc |
| 1998 | |
savecore on by default |
| |
SOLARIS 7 |
| 1999 | |
libproc for corefiles |
| |
coreadm(1M) |
| |
prstat(1M) |
| |
lockstat kernel profiling |
| PALEOZOIC |
MDB(1) |
| |
::findleaks |
| 2000 | |
SOLARIS 8 |
| |
EOL of crash(1M) |
| 2001 | |
live process control for MDB |
| |
EOL of adb(1) |
| |
pargs and preap |
| MESOZOIC |
kernel CTF data |
| |
trapstat(1M) |
| 2002 | |
SOLARIS 9 |
| |
libumem(3LIB) and umem_debug(3MALLOC) |
| |
::typegraph for mdb(1) |
| 2003 | |
Userland CTF |
| |
coreadm(1M) content control |
| CENOZOIC |
DTrace(1M) |
| |
intrstat(1M) |
| 2004 | |
DTrace pid provider for x86 |
| |
pfiles with pathnames |
| |
DTrace sched, proc providers |
| |
CTF for core libraries |
| |
DTrace I/O provider |
| |
KMDB(1) |
| |
DTrace MIB, fpuinfo providers |
| |
Per-thread ptools |
These are my choices based on SCCS histories and putback logs. Obviously, I've failed to include some things. Leave a comment or email if you think something's not getting the recognition it deserves (keeping in mind this is a blog post, not a book).
1 I actually started exactly one day before DTrace integrated. But I had some experience (albeit limited) as an intern the previous year.
2 In all seriousness, it's not that I don't have to ever debug anything, or that the problems we have today are somehow orders of magnitude simpler than those in the past. What these tools provide is a dramatic reduction in time to root-cause. You still need the same inquisitive and logical mind to debug hard problems, its just that good tools let you form questions and get answers faster than you could before. Really good tools (like DTrace) let you ask the previously unanswerable questions. You may have been able to debug the problem before, but you would have ended up running around in circles trying to get data that's now immediately available thanks to DTrace.
I just saw this article the other day, and was amazed that someone could so clearly miss the point of OpenSolaris and Sun's business strategy. In the article, the (apparently anonymous) author tries to argue that Jonathan and Adam are on two opposing sides of an internal corporate struggle between the executivess and the engineers over business strategy. He begins with a gross misinterpretation of Jonathan's blogs:
As far as I can dissect Schwartz's argument, it's that IBM made a big mistake by building features on top of open source, because that removes any possible hardware lockin, and lets customers move to less expensive hardware that they can buy from any old vendor.
The point of Jonathan's argument (whether or not you believe it is up to you) is not about hardware at all. The point is that because IBM relies completely on Linux (and doesn't have a distribution of its own), they cannot offer the same integrated solution that others can. If you want to run Linux on IBM hardware, you'll most likely be running RedHat. But if RedHat provides an integrated application server, why would you turn around and buy WebSphere from IBM? Jonathan also discusses software lockin when moving between Linux distributions, but this also has nothing to do with hardware. Yet the author continues to focus on "hardware lockin," twisting Adam's blog entry into this lovely conclusion:
If Solaris is open-sourced, then Sun is about to undermine its own hardware lockin.
For historical reasons, Sun has a reputation of big iron and proprietary UNIX, and with that comes hardware lockin. But if you look at Solaris and the rest of our software stack, you'll quickly see this is no longer true. Whether or not you believe in Sun's hardware prices (think x86 and Opteron), you can't ignore the fact that Solaris runs on commodity x86 hardware. And even if you don't want Solaris, Java Desktop System runs on Linux, and Java Enterprise System runs on Windows and HP-UX. To top it off, they all run on open standards; you can move your apps to non-Sun systems (as long as they are standards compliant) without worry. So where's the hardware lockin? Sun's message is that we sell systems. Maybe we're not quite there yet, but eventually you'll be able to get an integrated software stack (JES or JDS) on whatever hardware platform and Operating System you choose. If you want to run Solaris, great. If you want to get hardware from us, great. If not, we'll make damn sure our software all works together as best as possible.
The author does, however, get one thing right: OpenSolaris will enable ports to new platforms. Ask a kernel engineer or ask a VP, they will both tell you this is a good thing. If our customers want to run on POWER, and OpenSolaris enables this port to happen, then we can bring the weight of our entire software portfolio onto the platform, and do it better than anyone else. Go buy your hardware from IBM, but we'll give you a complete software solution with Java Enterprise System for less than IBM. As Adam points out, OpenSolaris can only help Sun. We love choice, and choice is good.
I've been missing almost a week, mostly because of my involvement with the amd64 bringup effort. A while ago, I was recruited to get the ptools and mdb up and running in 64-bit mode. This certainly made me appreciate some of the old war stories - all the Solaris veterans have their favorite bug that they debugged using only hex dumps, a pocket knife, and a ball of string. Over time, you start taking the Solaris debugging tools for granted: try going back to Solaris 9 after spending a year with DTrace1. I was in for quite a shock when I learned that my spoiled lifestyle wasn't going to cut it in the jungles of amd64.
It's no secret that we've had the amd64 kernel up and running for a while now. Thankfully, I was not part of the initial bringup effort. Back when I joined, the kernel was already booting multiuser, and I never had to lay my finger on a simulator or diagnose a double fault. 64 bit applications would load and run (thanks in part to a certain linker alien2), but debugging them was basically impossible: no truss, no mdb, no pstack. So where do you begin?
Thankfully, we've had a 64-bit OS for years, and most of the infrastructure was already working. All our tools worked with 64-bit ELF files out of the box, for example. But a lot of things were still broken. I ended up along roughly the following path:
pstack on corefiles
So pstack segfaulted the first time I ran it. At this point I could run elfdump on the corefile, but not much else. The first task was getting pstack to run on corefiles, so I at least knew where to begin inserting my printf() statements. Walking a stack on amd64 can be a tricky thing - so I begin with a simple version that works 99% of the time.
mdb for corefiles
The next step was to get MDB chewing on these corefiles. A stacktrace is all well and good, but we need to be able to exmine registers and memory. This turned out to be quite a bit of work; mdb is quite a heavy consumer of libproc, and uses some little-used interfaces in libc (in particular, getcontext(2) and makecontext(3c) were annoying). But with a lot of printfs, a few fixes and a few hacks, we had post mortem debugging.
truss
Sadly, I can't take credit for this one. This turned out to be just a bug in fork(2), and once that was fixed, truss worked flawlessly.
mdb for live processes
This was not too difficult thanks to the magic of libproc, which allows us to manipulate live processes and corefiles through the same interface. A few minor tweaks were needed here and there, and some of the finer bugs have yet to be fixed, but it's basically working. Most of the ISA specific actions (such as setting breakpoints) are the same on ia32 and amd64.
agent LWP and pfiles
Finally, I had to get Psyscall (the libproc internal function that executes a system call in the context of a target process) working. This was particularly annoying, mostly because the code was poorly structured - rather than having separate ISA specific actions in different files, we had tons of #ifdefs scattered throughout the code. A large part of this was just ripping apart the code and restructuring it in a way that made porting easier. Someday when someone ports Solaris to run on Adam's laptop, they'll appreciate it.
In a testament to the portability of Solaris, there were no large infrastructure changes outside of Psyscall. Basically, I just fixed one small bug after another. So all the debugging tools are now up and running, and with Bryan and Matt helping, we have DTrace and KMDB as well. So now I can go back to a pampered life in my Hollywood Hills mansion; surrounded by DTrace, MDB, and a few of my closest ptools.
1 Solaris debugging can be roughly divided into three eras: pre-mdb (Paleozoic), pre-DTrace (Mesozoic), and modern day (Cenozoic). The arrival of CTF data could be seen end of the Triassic period and beginning of the Jurassic, while KMDB may begin the Pleistocene (a.k.a. modern) era. Sounds like an interesting science project...
2 There were many others involved in getting the kernel this far. But Mike's the only one with a blog, so he gets all the credit.
The other day on vacation, I ran across a Slashdot article on UNIX branding and GNU/Linux. Tne original article was mildy interesting, to the point where I actually bothered to read the comments. Now, long ago I learned that 99% of Slashdot comments are worthless. Very rarely to you find thoughtful and objective comments; even browsing at +5 can be hazardous to your health. Even so, I managed to find this comment, which contained in part some perspective relating to my previous post on Linux innovation:
I have been saying that for several years now. UNIX is all but dead. The only commercial UNIX likely to still be arround in ten years time as an ongoing product is OS/X. Solaris will have long since joined IRIX, Digital UNIX and VMS as O/S you can still buy and occasionaly see a minor upgrade for it.
There is a basic set of core functions that O/S do and this has not changed in principle for over a decade. Log based file systems, threads that work etc are now standard, but none of this was new ten years ago.
The interesting stuff all takes place either above or below the O/S layer. .NET, J2EE etc are where interesting stuff is happening.
Clearly, this person is not the sharpest tool in the shed when in comes to Operating Systems. But it begs the question: How widespread is this point of view? We love innovation, and it shows in Solaris 10. We have yet to demo Solaris 10 to a customer without them being completely blown away by at least one of the many features. DTrace, Zones, FMA, SMF, and ZFS are but a few reasons why Solaris won't have "joined IRIX, Digital UNIX, and VMS" in a few years.
Part of this is simply that people have yet to experience real OS innovation such as that found in Solaris 10. But more likely this is just a fundamental disconnection between OS developers and end users. If I managed to get my mom and dad running Java Desktop System on Solaris 10, they wouldn't never know what DTrace, Zones, or ZFS is, simply because it's not visible to the end user. But this doesn't mean that it isn't worthwhile innovation: as with all layered software, our innovations directly influence our immediate consumers. Solaris 10 is wildly popular among our customers, who are mostly admins and developers, with some "power users". Even though these people are a quantitatively small portion of our user base, they are arguably the most important. OS innovation directly influences the quality of life and efficiency of developers and admins, which has a cascading effect on the rest of the software stack.
This cascading influence tends to be ignores in arguments over the commoditization of the OS. If you stand at any given layer, you can make a reasonable argument that the software two layers beneath you has become a commodity. JVM developers will argue that hardware is a commodity, while J2EE developers will argue that the OS is a commodity. Even if you're out surfing the web and use a web service developed on J2EE, you're implicitly relying on innovation that has its roots at the OS. Of course, the further you go from the OS the less prominent the influence is, but its still there.
So think twice before declaring the OS irrelevant. Even if you don't use features directly provided by the OS, your quality of life has been improved by having them available to those that do use them.
More Solaris engineers are joining the blogging community every day. Recently we've been joined by Resource Management guru Andrei Dorofeev, Service Management Facility guru Liane Praza, and ZFS guru Matt Ahrens. Be sure to check out their blogs, as well as the other blogs listed at right; you're bound to find some cool stuff and interesting discussions.
Update: Dilpreet Bindra became the first member of team FMA (a.k.a. Predictive Self Healing) to join the blogging effort.
|
|