Reflections on OS integration Eric Schrock's Weblog
Musings about Fishworks, Operating Systems, and the software that runs on them.

Friday Sep 24, 2004

Normally, I'm hesitant to engage people on the internet in an argument. Typically, you get about one or two useful responses before it descends into a complete flame war. So I thought I'd take my one remaining useful response and comment on this blog which is a rebuttal to my previous post, and accuses me of being a "single Sun misinformed developer".

To state that the Linux kernel developers don't believe in those good, sound engineering values, is pretty disingenuous ... These [sic] is pretty much a baseless argument just wanting to happen, and as he doesn't point out anything specific, I'll ignore it for now.

Sigh. It's one thing to believe in sound engineering values, and quite another to develop them as an integral part of your OS . I'm not saying the Linux doesn't care about these things at all, just that they're just not a high priority. The original goal of my post was not "our technology is better than yours," only that we have different priorities. But if you want a technology comparison, here are some Solaris examples:

  • Reliability - Reliability is more than just "we're more stable than Windows." We need to be reliable in the face of hardware failure and service failure. If I get an uncorrectable error on a user process page, predictive self healing can re-start the service without rebooting the machine and without risking memory corruption. Fault Management Architecture can offline CPUs in reponse to hardware errors and retire pages based on the frequency of correctable errors. ZFS provides complete end-to-end checksums, capable of detecting phantom writes and firmware bugs, and automatically repair bad data without affecting the application. The service management facility can ensure that transient application failures do not result in a loss of availability.
  • Serviceability - When things go wrong (and trust me, they will go wrong), we need to be able to solve the problem in as little time as possible with the lowest cost to the customer and Sun. If the kernel crashes, we get a concise file that customers can send to support without having to reproduce the problem on an instrumented kernel or instruct support how to recreate my production environment. With the fault management architecture, an administrator can walk up to any Solaris machine, type a single command, and see a history of all faulty components in the system, when and how they were repaired, and the severity of the problems. All hardware failures are linked to an online knowledge base with recommended repair procedures and best practices. With ZFS, disks exhibiting questionable data integrity can automatically be removed from storage pools without interruption of normal service to prevent outright failure. Dynamic reconfiguration allows entire CPU boards can be removed from the system without rebooting.
  • Observability - DTrace allows real-world administrators (not kernel developers) to see exactly what is happening on their system, tracing arbitrary data from user applications and the kernel, aggregating it and coordinating with disjoint events. With kmdb, developers can examine the static state of the kernel, step through kernel functions, and modify kernel memory. Commands like trapstat provide hardware trap statistics, and CPU event counters can be used to gather hardware-assisted profiling data via libcpc.
  • Resource management - With Solaris resource management, users can control memory and CPU shares, IPC tunables, and a variety of other constraints on a per-process basis. Processes can be grouped into tasks to allow easy management of a class of applications. Zones allow a system to be partitioned and administrated from a central location, dividing the same physical resources amongst OS-like instances. With process rights management, users can be given individual privileges to manage privileged resources without having to have full root access.

That's just a few features of Solaris off the top of my head. There are Linux projects out there approaching some semblance of these features, but I'd contend that none of them is as polished and comprehensive as those found in Solaris, and most probably live as patches that have not made their way into few mainstream distributions (RedHat), despite years of development. This is simply because these ideas are not given the highest priority, which is a judgement call by the community and perfectly reasonable.

Crash dumps. The main reason this option has been rejected is the lack of a real, working implementation. But this is being fixed. Look at the kexec based crashdump patch that is now in the latest -mm kernel tree. That is the way to go with regards to crash dumps, and is showing real promise. Eventually that feature will make it into the main kernel tree.

I blamed crash dumps on Linus. You blame their failure on poor implementation. Whatever your explanation, the fact remains that this project was started back in 1997-99. Forget kernel development - this is our absolute #1 priority when it comes to serviceability. It has taken the Linux community seven years to get something that is "showing real promise" and still not in the main kernel tree. Not to mention the post-mortem tools are extremely basic (30 different commands, compared with the 700+ available in mdb).

Kernel debuggers. Ah, a fun one. I'll leave this one alone only to state that I have never needed to use one, in all the years of my kernel development. But I know other developers who swear by them. So, to each their own. For hardware bringup, they are essential. But for the rest of the kernel community, they would be extra baggage.

Yes, kernel debuggers are needed in a relatively small number of situations. But in these situations, they're absolutely essential. Also, just because you haven't used one yet doesn't mean it isn't necessary. All bugs can be solved simply by looking at the source code long enough. But that doesn't mean it's practical. The claim that it's "extra baggage" is bizarre. Are you worried about additional source code? Binary bloat? How can a kernel be extra baggage if I don't use it?

Tracing frameworks. Hm, then what do you call the kprobes code that is in the mainline kernel tree right now? :) This suffered the same issue that the crash dump code suffered, it wasn't in a good enough state to merge, so it took a long time to get there. But it's there now, so no one can argue it again.

Yes, the kprobes code that was just accepted into the mainline branch only a week and a half ago (that must be why I'm so misinformed). KProbes seems like a good first step, but it needs to be tied into a framework that administrators can actually use. LTT is a good beginning, but it's been under development for five years and still isn't integrated into the main tree. It's quite obvious that the linux kernel maintainers don't perceive tracing as anything more than a semi-useful debugging tool. There's more to a tracing framework than just KProbes - any of our avid DTrace customers (administrators) are living proof of this falsehood.

We (kernel developers) do not have to accept any feature that we deem is not properly implemented, just because some customer or manager tells us we have to have it. In order to get your stuff into the kernel, you must first tell the community why it is necessary, and so many people often forget this. Tell us why we really need to add this new feature to the kernel, and ensure us that you will stick around to maintain it over time.

First of all, every feature in Solaris 10 was conceived by Solaris kernel developers based on a decade of interactions with real customers solving real problems. We're not just a bunch of monkeys out to do management's bidding. Second of all, you don't implement a feature "just because some customer" wants it? What better reason could you possibly have? Perhaps you're thinking that because some customer really wants something, we just integrate whatever junk we can come up with in a months time. If this were true, don't you think you'd see Kprobes in Solaris instead of DTrace?

First off, this [binary compatibility] is an argument that no user cares about.

We have customers paying tens of millions of dollars precisely because we claim backwards compatibility. This is an example of where Linux is just competing in a different space than Solaris, hence the different priorities. If your customer is a 25 year old linux advocate managing 10 servers for the University, then you're probably right. But if your customer is a 200,000 employee company with tens of thousands of servers and hundreds of millions of dollars riding on their applications, then you're dead wrong..

The arguments he makes for binary incompatibility are all ones I've heard before. Yes, not having binary compatibility makes it easier to change a kernel interface. But it's not exactly rocket science to maintain an evolving interface with backwards compatibility. Yes, you can rewrite interfaces. But you can do so in a well defined and well documented manner. How hard is it to have a stable DDI for all of 2.4, without worrying that 2.4.21 is incompatible with 2.4.22-ac? As far as I'm concerned, changing compiler options that break binary compatiblity is a bug. Fix your interfaces so they don't depend on structures that change definition at the drop of a hat. Binary compatibility can be a pain, but it's very important to a lot of our customers.

And when Sun realizes the error of their ways, we'll still be here making Linux into one of the most stable, feature rich, and widely used operating systems in the world.

For some reason, all Linux advocates have an "us or them" philosophy. In the end, we have exactly what I said at the beginning of my first post. Solaris and Linux have different goals and different philosophies. Solaris is better at many things. Linux is better at many things. For our customers, for our business, Linux simply isn't moving in the right direction for us. There's only so many "we're getting better" comments about Linux that I can take: the proof is in the pudding. The Linux community at large just isn't motivated to accomplish the same goals we have in Solaris, which is perfectly fine. I like Linux; it has many admirable qualities (great hardware support, for example). But it just doesn't align with what we're trying to accomplish in Solaris.

So it seems my previous entry has finally started to stir up some controversy. I'll address some of the technical issues raised here shortly. But first I thought I'd clarify my view of the GPL, with the help of an analogy:

Let's say that I manufacture wooden two by fours, and that I want to make them freely available under an "open source" license. There are several options out there:

  1. You have the right to use and modify my 2x4s to your hearts content.

    This is the basis for open source software. It protects the rights of the consumer, but imparts few rights to the developer.

  2. You have the right to use my 2x4s however you please, but if you modify one, then you have to make that modification freely available to the public in the same fashion as the original.

    This gives the developer a few more guarantees about what can and cannot be done with his or her contributions. It protects the developer's rights without infringing on the rights of the consumer.

  3. You have the right to use my 2x4 as-is, but if you decide to build a house with it, then your house must be as freely available as my 2x4.

    This is the provision of the GPL that I don't agree with, and neither do customers that we've talked to. It protects my rights as a developer, but severely limits the rights of the consumer in what can and cannot be done with my public donation.

This analogy has some obvious flaws. Open source software is neither excludable nor rival, unlike the house I just built. There is also a tenuous line between derived works and fair use. In my example, I wouldn't have the right to the furniture put into your house. But I feel like its a reasonable simplification of my earlier point.

As an open source advocate, I would argue that #1 is the "most free". This is why, in many ways, the BSD license is the "most open" of all the main licenses. As a developer, I would argue that #2 is the best solution. My contribution is protected - no one can make changes without giving it back to me(and the community at large). But my code is essentially a service, and I feel everyone should have a right to that service, even if they go off and make money from it.

The problems arise when we get to #3, which is the essential controversy of the GPL. To me, this is a personal choice, which is why GPL advocacy often turns into pseudo-religious fanaticism. In many ways, arguing with a GPL zealot is like an atheist arguing with a religious fundamentalist. In the end, they agree on nothing. The atheist leaves understanding the fundamentalist's beliefs and respects his or her right to have them. The fundamentalist leaves beliving that the atheist will burn in hell for not accepting the one true religion.

This would be fine, except that GPL advocates often blur the line between #2 and #3, and make it seem like the protections of #2 can only be had if you fully embrace the GPL in all its glory. I support the rights provided by #2. You can scream and shout about the benefits of #3 and how it's an inalienable right of all people, but in the end I just don't agree. Don't equate the GPL with open source - if you do want to argue the GPL, make it very clear which points you are arguing for.

One final comment about GPL advocacy. Time and again I see people talk about easing migration, avoiding vendor lockin, and the necessity of consumer choice. But in the same breath they turn around and scream that you must accept the GPL, and any other license would be pure evil (at best, a slow and painful death). Why is it that we have the right to choose everything except our choice of license? I like Linux. I like the GPL. The GPL is not evil. There are a lot of great projects that benefit from the GPL. But it isn't everything to all people, and in my opinion it's not what's best for OpenSolaris.

[ UPDATE ]

As has been enumerated in the comments on this post, the original intent of the analogy is to show the the definition of derived works. As mentioned in the comments:

Say I post an example of a function foo() to my website. Oracle goes and uses that function in their software. They make no changes to it whatsover, and are willing to distribute that function in source code form with their product. If it was GPL, they would have to now release all of Oracle under the GPL, even though my code has not been altered. The consumer's rights are preserved - they still have the same rights to my code as before it was put into Oracle. I just don't see why they have a right to code that's not mine.

Though I didn't explain it well enough, the analogy was never intended to encompass right to use, ownership, distribution, or any of the other qualities of the GPL. It is a specific issue with one part of the GPL, and the analogy is intentionally simplistic in order to demonstrate this fact.