OS Technology

Tim Marsland's Weblog
Wednesday May 07, 2008

OpenSolaris 2008.05

I'm thrilled by the latest delivery of the Sun's OpenSolaris distro, and I was happy to see the "party" atmosphere at Community One on Monday. A lot of people have worked very hard to put the thing together, and everyone wanted to mark this event - both as the culmination of a lot of work, and as the beginning of a new phase of building out the distro with more content - both by Sun, our business partners, and by our communities. Thanks to everyone for their efforts.

If you haven't already given it a spin, try it out - it's a live CD image you can download from here.

Monday May 05, 2008

VirtualBox 1.6 at Community One

Hmm. Always wondered if I could write a blog entry "live" from a conference session. Since I'm writing this entry "live", I'll apologise ahead of time for muddling my tenses as I move more into the present tense!

I'm here at community one listening to Joost Pronk and Achim Hasenmueller talking about the latest improvements and coolness in VirtualBox. Achim's team has just released version 1.6, which has an impressive list of features that can be found in the changelog right here.

  • Hey, Jonathan just dropped by to say hello before the talk begins.
  • Now Joost is explaining the context i.e. the different virtualization technologies that Sun offers and where VirtualBox fits in the picture.
  • Next, Achim is introducing the core technology, the history, the design, how it's being used.
  • Christoph Schuba is now doing a demo of Trusted Solaris multi-level security with Windows Vista running in VirtualBox, which is a neat combination of the Zones technology on which Trusted Solaris is based, and the VirtualBox type 2 hypervisor.
  • Oh and Christoph is also running OpenSolaris native on a MacBook - because, as Christoph says, OpenSolaris runs on lots of machines. Very, very, nice demo.
  • Now Achim is walking through configuring and running OpenSolaris on VirtualBox. Oh cool, now he's showing off the latest guest additions for OpenSolaris, which means things like mouse pointer integration and seamless mode now work too.
  • Now we're delving into the architecture a little, and Achim's talking about the new WSDL/SOAP interface to VirtualBox that's a part of 1.6 to allow other management tools to manipulate the hypervisor and its guests. The default graphical UI and the command line UI already use that interface; now the API is available to allow developers to do what they want with it.
  • Achim's also particularly happy about the virtual SATA controller that's new for 1.6
  • VirtualBox now runs on Solaris 10U4 and OpenSolaris too. Some nice integration with zfs and zpools, with integrated iSCSI support. Oh and VirtualBox is Zones aware too, and it can be used simultaneously in multiple zones.
  • Mac support is getting better, but there's still more in the pipeline.

Now the Q&A begins

  • Q: Virtual Machine migration tools?
  • A: Better support for VMDK formats in the pipeline
  • Q: Older hardware?
  • A: Tend to run out of memory if you want to run modern guests! Otherwise, anything that's Pentium III or above works.
  • Q: Compare and contrast with Xen?
  • A: Xen is a type 1, VirtualBox is a type 2, Xen emphasis on paravirtualization, VirtualBox focus on usability.
  • Q: What kind of shared storage for live migration
  • A: No specific restrictions at this point - files, devices, or iSCSI targets

Now some roadmap items that are being worked on

  • Memory ballooning
  • 64-bit guests
  • Live migration
  • 3D virtualization
  • More portable snapshots
  • VMDK support
  • VHD support
  • Nested paging for AMD-V and VT-x
  • Next generation seamless windowing with better desktop integration
  • Paravirtualization using VMI and Windows Enlightenments

You can download VirtualBox 1.6 by following the downloads link from virtualbox.org.

Thursday May 01, 2008

Solaris 9 containers available too

In case anyone didn't notice, we released Solaris 9 containers too. There's a datasheet on it right here, and Dan Price has a great entry on it here.

Wednesday Mar 26, 2008

One last thing from TechDays

I gave the OpenSolaris Virtualization Technologies talk at Sun TechDays in Sydney, and there were a couple of slides from that I really wanted to post here as this stuff continues to surprise people. First, let's go back a few years. When we first started exploring the technology that became Solaris Zones, we quickly realized it was a very powerful idea, and while we were creating the basic version for Solaris 10's initial release, Glenn and team went off to implement Trusted Solaris using Zones. A little later, Nils and team went off and implemented "Brand Z" - a general mechanism to allow a Zone to have a completely different system call interface, which in turn allows a zone to have a completely different OS personality, and for example, to allow us to run the userland components of a Linux distribution in a zone.

The most recent application though, is one that has some interesting properties which illustrate some of the properties of virtualization more generally. That's Project Etude, which Dan Price and his colleagues delivered into Solaris 10 a few months ago - the essential concept being a Solaris 8 Zone running on a Solaris 10 system. This slide from marketing shows where the Brand Z personality component for Solaris 8 lives in the usual zones architecture picture.

solaris-8-brand.jpg

Other than "because we can" :-) why did we do this? Well, the really compelling point associated with this technology is how we can move our Solaris 8 / SPARC customers forward from their older, under utilized, power-hungry hardware to the latest generation of power efficient SPARC hardware. Like the commodity SPARC microprocessors in the CoolThreads family of servers based on the Niagara processors. Looking from a 2008 vantage point, Solaris 8 seems like quite an old operating system now, and many of the machines it was originally installed on are from the same era. In the meantime Moore's law has advanced, and when combined with the CMT capabilities of Niagara, has enabled us to create very powerful, but extremely power efficient systems that can be used to consolidate multiple machines-worth of workload. There are obvious analogs in the legacy Windows / legacy x86 world that's powering the Windows consolidation move across the industry. But the thing I like about this customer case study below, is the actual numbers involved.

eco-consolidation.jpg

The 8800W and 275W numbers are power consumption comparisons, the BTU numbers are cooling costs. Smaller is better! These are real cost and power savings, and I think it illustrates that maybe there's some interesting comparative work to do here - comparing the efficiency of consolidation using different virtualization technologies e.g. zones vs. hypervisor virtualization technologies. A complex problem to characterize, yes. And interesting trade-offs between isolation, management complexity, business agility and performance. But our repeated experience is that this path is proving very effective for our customers.

Final reflection - I remember wondering why I had to learn all about classical thermodynamics in my engineering courses. What did that have to do with computers and communications technology? And now the irony is that for many of our datacenter customers, their number one concern is the basics of classical thermodynamics and the brutal economics associated with it - getting ever more computational work done while saving power and reducing the need for cooling.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Mar 19, 2008

Sydney to Stuttgart

So, here I am still on the road, starting to write this in Stuttgart airport, after spending a couple of days with the VirtualBox team. We've been introducing them to Sun, and doing a first pass on the roadmap ahead. The usual balance of lots of things we'd like to do, with software physics nipping at our heels - there's always more to do than we have people or time for!

Talking about VirtualBox in Australia was great fun; I introduced it to the audience of my keynote, which as I remarked in my earlier post was one of the largest audiences I've presented to. The full talk should eventually make it to the techdays site, but there are a couple of slides I wanted to review here now for people that can't get them and don't have time to navigate the site to find them. Basically, my talk was about Open Source Virtualization and Project Indiana, and how the two are connected. At first sight, one might imagine that they're not particularly connected at all. VirtualBox is an open source desktop hypervisor, a key component in our virtualization technology portfolio, but otherwise only related to OpenSolaris as a possible host and guest OS. VirtualBox is part of a more general application lifecycle picture involving xVM server and xVM Ops Center, as is nicely captured in this marketing graphic.

app-life-cycle-xvm.jpg

On the other hand, Project Indiana is, among other things, about bringing the technology of distribution to the OpenSolaris technology code base. This graphic from Ian Murdock's presentation at the Hyderbad techdays conference captures this notion of the OpenSolaris core surrounded by additional, up-to-date, packages from various open source communities nicely.

best-of-both-worlds.jpg

And this one summarizes the Indiana project goals:

opensolaris-project-indiana.jpg

So what is the connection between them? The connection is in the technology called the distro builder. This is the part the takes the recipe for a distribution that performs a particular set of functions and capabilities, and translates it into the right set of interdependent packages that can perform that function. At first sight, this seems a bit of a niche interest too. After all, relatively few people want to build OS distros .. surely? Well, it's more than you might think. In fact, if you take a look at many IT organizations, they tend to build something very similar for their internal use as a way of reducing costs. They rarely if ever take a vanilla OS install from a vendor. There's always some customization; packages added, removed, administrative pre-configuration of various kinds. So one important target for our distro builder technology are those IT organizations. Another important target are the developers creating virtual appliances to demonstrate their technologies. You can find these virtual appliances on lots of companies web sites these days - usually a file-based image that you can download and quickly instantiate on top of a hypervisor. The advantage being that you -don't- have to install and configure the software you're interested in exploring directly on your desktop operating system - you run it inside a preconfigured OS as a virtual machine. Then when you tire of the demo, you can discard it quickly, and completely, by discarding the virtual machine. So here's the graphic I kludged up to represent how the Indiana project and VirtualBox fit into that picture.

virt-plus-indiana.jpg

On the left side is the distro builder, on the right side is the developer to deployer flow using VirtualBox on the desktop, and xVM Server and xVM OpsCenter in the data center. If you want to find out more about the IPS packaging system that underpins this picture, then I'd recommend starting on Stephen Hahn's blog.

An open question we've pondered for a while is when virtual appliances will move beyond internal-use and demo-ware to be a mainstream style of SW distribution. There are several vendors who would like this to be true sooner than later. But the question is open for several reasons, performance is a universal concern, licensing of well-known proprietary operating systems in this context is another issue. Another important barrier is how to resolve the support issues for all the software involved that the creator of the application, or facility being demonstrated, didn't write. Let's think about a more concrete example. Assume I'm a developer at a small company, and I create a compelling application that uses various application infrastructure components e.g. an application server, a database, and, e.g. a chunk of ruby-on-rails to do it's work. Then I install and configure it all on an OS image of my choosing, make it into a virtual appliance, then deliver it to my customers that way. That packaging and configuration work I did is a clear advantage to my customer in terms of getting started with the application and exploring the value of the offering - which is one of the reasons vendors are using it more and more for demoware. But having delivered a complete package that way, it seems like as well as being responsible for the defects in my application, I'm now -also- responsible for all the defects in the customized stack I created too. Do I now have to start worrying about all the security patches I might need to include in my virtual appliance? Eeek! That seem really hard. We probably need a better answer to that support question that involves the vendors of the components I used before delivering fully-customized yet fully supported software stacks as virtual appliances really takes off.

Technorati Tag: OpenSolaris
Technorati Tag: VirtualBox

Thursday Mar 06, 2008

It's Wednesday afternoon, so this *must* be Melbourne ...

small_IMG_0112.JPG

Hmm. My first blog entry on the road; writing this in terminal 3 in Sydney airport with my battery running out. Let's see how well this works.

There are quite a number of Sun software engineers in Australia this week. We've been holding one of our TechDays developer conferences here in Sydney. If you get a chance to attend a TechDay in your city or within reach, I do recommend it: lots of information, lots of cool demos and giveaways, and real engineers presenting interesting talks about what they're passionate about. As an example, I heard George Wilson's excellent talk about the work going on in the opensolaris storage community for the first time today - and was intrigued by how interested the audience was in the ZFS-aware installer demo he showed. It's very clear that engineers all over the world are excited by ZFS.

I had a small part in the proceedings too - I did the community keynote on Wednesday morning, and talked about virtualization, Project Indiana, and the connections between them. Probably the largest audience I've ever spoken too; rather daunting; I hope it made some kind of sense to people. And this afternoon I did an opensolaris virtualization technologies talk. In between those events, I've been flying around Australia, visiting universities in Sydney, Melbourne, Canberra and (tomorrow) Brisbane. As you can tell from the title of this post, I'm a little dizzy. But as you'd probably expect, I just can't resist talking about Sun xVM, and VirtualBox and doing demos wherever I go. A few people at UTS had heard of VirtualBox, and were simply pleased to hear about the acquisition, the evolving roadmap, and our investment in this technology. Most of the others hadn't heard of it, but were definitely interested. At the conference, I started seeing it on everyone's desktop I caught sight of.

Don Kretsch, and Liang Chen are also visiting the Universities, talking about HPC and HPC Tools which has sparked a lot of interesting conversations. Josh Marinacci was also with us in Sydney, talking to the students about Dynamic Languages. He showed some fun JavaFX demos - you can find them on Josh's blog.

Though it's been tough to keep track of all the people I've met, it's been wonderful meeting students and faculty across the country. Really smart people who are interested in the technologies Sun is working on around Virtualization, HPC and Dynamic Languages. And we were there to listen to them talk about the technical and scientific problems they're working on - mostly around HPC, but also in other areas e.g. the challenges of scale and parallel programming presented by multicore architectures, complex real-time systems, and more.

As an engineer, it's always been important to me to build real things that other people find useful. That's the ultimate intellectual reward for me, and I think for most of the other engineers at Sun. It's not just about sharing and community in some remote, abstract sense, it's about making positive, real, engineering contributions to a community that can then use them. I think that's fundamentally what makes all engineers tick. So, this morning at ANU, I was surprised and pleased to hear of some work they've been doing, assessing the effectiveness of the MPO subsystem (memory placement optimizations, aka interfaces to describe NUMA machines) we built in OpenSolaris for Opteron-based systems. And in particular, they'd been using the lgroup abstraction, which I had a small hand in the architecture and initial design of a few years ago. So it was great to see the lgroup API being used, the implementation assessed against real needs, and found to be doing well; being used just how we hoped it would. And I'm looking forward to connecting these graduate students and their results with my colleagues Jonathan Chew and Bart Smaalders who put the hard work in designing and implementing the MPO code - I know they'll be interested in the results, and looking for ways to make MPO even better.

I think I'm completely smitten with Australia the country. It's my first visit here, and even though I haven't seen that much beyond CS department seminar rooms, the Darling Harbour Convention center, and the insides of various airports, hotels and taxis, I'm really drawn to the people I've met, and the comfortable feel of the culture. I need to come back again on a family vacation so we can sample the great outdoors too, and imbibe the history (and some of the wine). It's also looks to be a very beautiful place outside of the cities - even though I've only seen tantalizing glimpses on this trip.

Technorati Tag: OpenSolaris
Technorati Tag: VirtualBox

Tuesday Feb 12, 2008

VirtualBox

It's official. We just announced our intent to acquire innotek - a small company in Germany with (a) some very smart people and (b) some very significant technology, called VirtualBox. What is VirtualBox? Well, if you're a hypervisor engineer, then it's best explained as a high performance type 2 hypervisor that uses a combination of virtualization techniques to run many different unmodified operating systems in x86 virtual machines. It's highly portable across multiple hosts and supports a wide range of guest operating systems.

But perhaps that's a bit dry. And you don't need to be a hypervisor engineer to find it extremely useful.

Think of it this way. If you download and install VirtualBox on your laptop - running Windows, MacOS X, Linux or OpenSolaris, you can then run most any other popular Operating System on the same machine. Or several at the same time, depending on what hardware resources are available. The download is around 25Mbytes on most platforms. And what's truly cool about it for developers is that the download is free for personal use, and the code for VirtualBox is GPLv2 open source. So as well as VirtualBox being a cool product and a powerful set of technologies, it's also a community, and a great fit with Sun's broader open source strategies.

We think this tool is incredibly useful for developers - because most developers want to target multiple operating systems to maximise their audience and return on the time they've invested in their applications, and tools like VirtualBox let them do that by running everything - test environments, debug environments, etc. - on a single laptop. How does VirtualBox stack up against the other laptop and desktop options? Well I think it's great, but you don't have to take my word for it - there's a couple of great reviews here and here.

OpenSolaris and VirtualBox

My first conversation with the innotek engineering team was over a year ago. They told me about the work they'd been doing, what VirtualBox was capable of back then, where they were going, and how they'd just made it be an open source project. I was really impressed. And in many ways we've been working on a closer relationship ever since. Things really started to move quickly when we visited them last September. At that time, builds of OpenSolaris had already been working as guests, but after a marathon effort the night before we arrived at their offices, they managed to demonstrate OpenSolaris as a host for VirtualBox - a pretty significant capability for OpenSolaris. I took this screenshot during the first few hours of it working.

IMG_0050.JPG

That's IE running inside in Windows XP in the foreground, displaying the opensolaris.org home page. The next day was even more exciting when they showed me seamless mode - applications running under XP sitting on the OpenSolaris desktop in the image below.

IMG_0052.JPG

Windows Media Player running on the OpenSolaris desktop - whatever next?

For people running OpenSolaris, there's a new beta version of VirtualBox that's just been posted on the virtualbox.org site. Alternatively you can build it from source by following the instructions that Joe Bonasera posted on his blog.

Have fun with it!

Technorati Tag: OpenSolaris
Technorati Tag: VirtualBox

Monday Feb 11, 2008

Five things you probably don't know about me

It's been a long time since I've written an entry, despite having been tagged by James just over a year ago. So let's start with that. Here's five things you probably don't know about me:

  1. I'm an AFOL - an Adult Fan Of Lego. I have far too much Lego at home, but it's impossible for me to resist the new sets every year. I just got a note from amazon.com telling me that a book my sister chose for me for Christmas just shipped - it has the coolest title - "Forbidden Lego: Build the Models Your Parents Warned You Against!" Though I don't recall my parents ever warning me ...
  2. I grew up in a tourist town on the coast of the north west of England. Pleasant enough, yes, but not really an intellectual hotspot. I still remember the first time I picked up and read Scientific American in W H Smiths - I was 13 and it was literally as if a whole new world opened up. First of all it contained a bunch of exotic-seeming ads for US technology companies and products. But most importantly, Martin Gardner's column, Mathematical Games, gave me a completely different perspective on what, up to that point, had been a difficult subject for me.
  3. A lot of my time at Cambridge as a graduate student was spent on EM fields - I guess I just got hooked on Maxwell's equations. So I was thrilled in 2002 to visit the Very Large Array in New Mexico with James, Rob Gingell, Jim Mitchell, Josh Simons and others. The VLA was designed in the late 1970's and used very long segments of circular waveguide - basically hollow pipes - to carry the signals from each antenna to the central facility. This particular waveguide was constructed from a single insulated strand of helical wire wound on the inside of an otherwise hollow tube. That construction allows a low-loss TE01 transmission mode to propagate down the guide, and prevent it's conversion to other, lossier modes. Before the trip I'd only ever thought about this in theoretical form, so it was very cool to suddenly be face-to-face with a huge deployment of this idea. But I remember my colleagues thinking I was a little bit strange to be so fascinated by this stuff. Hmm, there on the web you can find a performance evaluation of the system.
  4. Back in 1987 I was a Fellow of Clare College, Cambridge, and rowed in the Fellows VIII in the May Races. Since I hadn't rowed as an undergraduate, it involved learning how, and I made many early morning outings on the River Cam, getting up at 6am and cycling across town in the cold spring air. At the time I was still living the nocturnal life of the hacker, so the truly surprising part was that I stuck with it through to the day of the competition! Of course I don't remember if we were bumped, or if we bumped the boat in front of us, but I guess it's the taking part that counts.
  5. When I first started using Sun's software at Cambridge University, one of the things that really impressed me back then was the quality of the documentation - in particular the SunOS 4.1 manpages. We used to use them as a kind of definitive reference work, they certainly were a lot better than the offerings of the other vendors at the time. Nine years after I joined Sun, in a typical twist of geekdom, I married one of the writers that created them.
Phew, that's finally done. Since it's so long ago when I was tagged, I'm not going to propagate it, and I'll quietly put this thread of the tag-fest to rest.

In the meantime, what else has been happening? Well, since the last entry, Bob Brewin and I have been working for Rich Green as the two CTOs responsible for Sun's Software technology portfolio. We also report to Greg Papadopolous. I like to joke that Bob handles everything beginning with J, and I do the rest - though in reality we work pretty closely, managing the technology portfolio, reviewing what we're doing and how we're doing it, listening to customers, identifying gaps and opportunities. I've also been continuing a special interest in Virtualization and the technology underlying the Indiana program. It has been, and continues to be sometimes frustrating, sometimes fun but almost always interesting. Though the pace is becoming frenetic as more and more opportunity comes our way as Sun's software business expands - the truly cool mySQL acquisition being the most recent example of that.

Friday Jul 14, 2006

OpenSolaris on Xen update

We've just posted a significant update to the source and binaries of the OpenSolaris on Xen project.

New capabilities include OpenSolaris as domain 0 - currently the controlling domain, and I/O domain for Xen-based systems - as well as both 32-bit and 64-bit kernels, diskful and diskless guests, and up to 32-way virtual SMP.

It's been a lot of work by a small team of dedicated people. It's still a work-in-progress, there are bugs, things we haven't finished yet etc., but we thought it was a good time to share where we are so far with the OpenSolaris developer community.

Technorati Tag: OpenSolaris

Monday Feb 13, 2006

Opening Day for OpenSolaris on Xen

Today, we're making the first source code of our OpenSolaris on Xen project available to the OpenSolaris developer community.

There are many bugs still in waiting, many puzzles to be solved, many things left to do.

A true work in progress.

Because we don't believe the developer community only wants finished projects to test. We believe that some developers want to participate during the development process, and now this project can open its doors to that kind of participation.

We wanted to start the conversation with working code. So we have a snapshot of our development tree for OpenSolaris on Xen, synced up with Nevada build 31. That code snapshot should be able to boot and run on all the hardware that build 31 can today, plus it can boot as a diskless unprivileged domain on Xen 3.0. While we were in our final approach to this release, we got live migration to work too, which is one of the key features we've been working on.

Running on Xen, OpenSolaris is reasonably stable, but it's still very much "pre-alpha" compared with our usual finished code quality. Installing and configuring a client is do-able, but not for the faint of heart.

To find out more, see the OpenSolaris on Xen community

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Aug 09, 2005

Solaris Deployment and Kernel Development with Diskless Clients

We did the the Xen bringup effort using NFS diskless technology. Why? Are we just so infatuated with NFS that we can't resist using it? Nope, there are at least two good reasons: bringup and deployment.

The Kernel Bringup Cycle
One of the things that characterizes bringup projects (particularly ones based on printf-debugging(!)) is a repeated cycle of fix-test-debug. This is different from incremental development. You're not working on one particular subsystem or tricky bug in a 99% working system, you're continually moving from subsystem to subsystem as the bringup progresses.

If you do disk-based bringup, to install the next version of the system, you have to boot that system using a working kernel, install the bits, then reboot to try them out. This can be quite a pain on a completely new machine, where a working kernel may not exist at all in which case you're continually recabling the disk. Even if you can boot it, after a while, you quickly get tired of listening to the disk clank and whirr, the BIOS chug through its tests to announce what identical things it has (re-)detected over and over again.

Diskless bringup using NFS is a lot faster and (once you tweak the configuration correctly) simpler. Instead you just place the new kernel bits you want to test onto an NFS server, then just boot the client machine. And of course booting diskless domains under Xen is even simpler because there's no BIOS involved at all - Xen's domain builder is vastly simpler and faster.

Once I/O happens over the network, you can easily observe what the client kernel is actually doing via snoop(1M), watching the first RARP and ARP attempts, through to the fully fledged NFS traffic between client and server.

Finally, of course, it's easier to work on the disk driver this way too, with a fully functional diskless system around you. That also helps as you don't place your boot image at quite the same level of risk as when testing your prototype driver.

Deployment
One of the key things that Xen can do is transparent workload migration; that is the ability to move a running pure-virtual domain from machine to machine with almost imperceptible down-time. Diskless operation is a natural environment for exploring domain migration across a pool of machine resources in a data center, because of the various advantages of file-based protocols generally, and because that state is in storage across the network.

Diskless operation is also a means of managing the multiple OS images and patch levels for all the virtual machine environments that you might want to create on a pool of hardware resources. That is one of the biggest problems with large scale virtual machine deployments, and one of the problems that OS virtualization technologies like Solaris Zones neatly avoids.

While we're on this topic, I thought people might be interested in this little vignette. I attended a virtualization BOF at the Ottawa Linux Symposium a week or two ago; people from Red Hat, VMware and IBM spoke, but I was quite surprised when one of the IBM VM technologists stood up and said that Solaris Zones solved the problem of managing multiple OS images really well, and how their customers were asking for it, and how he wished the Linux community could extend projects like vservers to try to solve those problems in a similar way. Since IBM has been working on virtualization technologies for many, many years, and I have a lot of respect for their technology and experience in this area, I took that as quite a compliment to what we built in Solaris 10.

There is no "one-size fits all" virtualization technology; each has their own advantages and disadvantages. Eric Schrock wrote a great explanation of the relationship between the technologies i.e. where OS virtualization technology like Zones is useful, and where hardware virtualization technology like Xen can help. I strongly believe the two technologies are complementary, and will allow customers to provide a balance between utilization, isolation while keeping a lid on complexity for large-scale deployment.

I rather suspect the IBM VM guys think so too.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: Xen

Thursday Jul 28, 2005

"Hello World" from Solaris on Xen

Last Friday we went multiuser for the first time on our Solaris-on-Xen port. (For the uninitiated, Xen is an open source hypervisor from the University of Cambridge - see http://xen.sf.net)

The underlying hardware is a 2-way Opteron box. We're at a point in the port where we still emit loads of debugging noise, and the boot-up sequence itself isn't that interesting, but it all works pretty well, and I just ssh'ed into the virtual machine and posted this very blog entry from a Solaris domU running side by side with a dom0 kernel, and domU version of Linux.

Here's some cut-and-paste of my ssh session:

hostname% uname -a
SunOS hostname 5.11 fpfix-2005-07-22 i86xen i386 i86xen
hostname% isainfo -x
i386: sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov cx8 tsc fpu
hostname% psrinfo -vp
The physical processor has 1 virtual processor (0)
  x86 (AuthenticAMD family 15 model 5 step 10 clock 2391 MHz)
        AMD Opteron(tm) Processor 250

Here's what this looks like from the dom0 side - the "control" kernel for the machine. (If this doesn't look right in your browser, it's my fault - the command really does generate nicely aligned columns ..)

# xm list
Name             Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0          0      955    0  r----   1115.2        
linuxhost        67      128    0  -b---      5.5    9667
hostname         66      511    1  -b---    539.7    9666

Thanks to everyone on the team - Todd, Joe, Stu, for all their hard work, and thanks to Ian Pratt of the Xen project for patiently answering our silly and occasionally not-so-silly questions.

S'very cool -- it boots pretty fast, despite the fact that we're still relying on non-batched hypervisor calls, and the kernel code is covered in ASSERTs. No, we don't have domain migration working yet - Joe and Stu are working on the infrastructure for that right now.

What's next for Solaris-on-Xen?
Well, we've been working off the (now rather long in the tooth) Xen 2.x source base; and we need to move on to the next major release of Xen: specifically the Xen 3.0 source base which the Xen team in Cambridge say is close to entering its testing phase. Xen 3.x provides a bunch of interesting capabilities that we're keen to explore: multiprocessor guests, 64-bit kernels, and we also want to make it possible to use Solaris in domain 0. Ian gave a great presentation on Xen, including more 3.0 details last week at OLS.

And we're happy to have other people join this project at this early stage to help us do that, or even just to experiment with the code in whatever other way they want to. To enable that, we're launching an OpenSolaris community discussion group about OpenSolaris on Xen where future postings like this will end up. Just to set expectations - we do have a wad of cleanup of the 2.0 work to do, and we have to sync up with the Solaris gate so that we get catch up with OpenSolaris (we started before the OpenSolaris launch and we've been based off build 15 ever since). There's a few weeks work to do there.

Obviously, we're early in the evaluation phase of this technology, and while the the capability and code base will be an OpenSolaris project, it won't be integrated into the top-level OpenSolaris tree until the project is complete. So please don't expect these capabilities to show up in the official OpenSolaris builds for quite a while. No, I don't even have a schedule for when they will.

How to participate? Well we're working on various ideas to make source and various builds and other components available to let the OpenSolaris community try this technology out, give us feedback, and participate in the engineering.

You can keep up with what we're doing by joining the OpenSolaris on Xen community at opensolaris.org which should appear in a day or two.

Update
The OpenSolaris on Xen community is now up: register on the the OpenSolaris web site to participate.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: Xen

Tuesday Jun 14, 2005

Opening Day

This is opening day, and I want to say "Welcome!" to everyone that's interested in taking a look under the hood of, and tinkering with, our favourite operating system. It's taken many of us a lot of hard work to get this far, and yet this is where the conversation starts, and the journey really begins.

I'm really looking forward to participating, and seeing what we, the OpenSolaris community, can build. Together.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Solaris 10 on x64 Processors: Part 4 - Userland

Userland

The amount of work involved in the kernel part of the amd64 project was fairly large, fortunately the userland part was more straightforward because of our prior work on 64-bit Solaris on SPARC back in 1997. So, for this project, once the kernel work, which abstracts the hardware differences between processors, was done, many smaller tasks appeared that were mostly solved by tweaking Makefiles and finding occasional #ifdefs that needed something added or modified. Fortunately, it was also work that was done in parallel by many people from across the organizations that contribute to the Solaris product.

Of course there were other substantial pieces of work like the Sun C and C++ compilers, and the Java Virtual Machine; though the JVM was already working on 32-bit and 64-bit Solaris on SPARC as well as 32-bit on x86, and the Linux port of the JVM had already caused that team to explore many of the amd64 code generation issues.

One of the things we tried to do was to be compatible with the amd64 ABI on Linux. As we talked to industry partners, we discovered that there was a variety of interpretations of the term "ABI." Many of the people we talked to outside of Sun thought that "ABI" only referred to register usage, C calling conventions, data structure sizes and alignments. A specification for compiler and linker writers, but with little or nothing beyond that about the system interfaces an application can actually invoke. But, the System V ABI is a larger concept than that, and was at least intended to provide a sufficient set of binary specifications to allow complete application binaries to be constructed that could be built once, and run on any ABI-conformant implementation. Thus Sun engineers tend to think of "the ABI" as being the complete set of interfaces used by user applications, rather than just compiler conventions; and over the years we expanded this idea of maintaining a binary compatible interface to applications all the way to the Solaris application guarantee program.

Though we tried to be compatible at this level with Linux on amd64, we discovered a number of issues in the system call and library interfaces that made that difficult, and while we did eliminate gratuitous differences where we could, we eventually decided on a more pragmatic approach. We decided to be completely compatible with the basic "compiler" style view of the ABI, and simply try and make it simple to port applications from 32-bit Solaris to 64-bit Solaris, and from Solaris on sparcv9 to Solaris on x64, and leave the thornier problems of full 64-bit Linux application compatibility to the Linux Application Environment (LAE ) project.

Threads and Selectors

In previous releases of Solaris, the 32-bit threads library used the %gs selector to allow each LWP in a process to refer to a private LDT entry to provide the per-thread state manipulated by the internals of the thread library. Each LWP gets a different %gs value that selects a different LDT entry; each LDT entry is initialized to point at per-thread state. On LWP context switch, the kernel loads the per-process LDT register to virtualize all this data to the process. Workable, yes, but the obvious inefficiency here was requiring every process to have at least one extra locked-down page to contain a minimal LDT. More serious, was the implied upper bound of 8192 LWPs per process (derived from the hardware limit on LDT entries).

For the amd64 port, following the draft ABI document, we needed to use the %fs selector for the analogous purpose in 64-bit processes too. On the 64-bit kernel, we wanted to use the FSBASE and GSBASE MSRs to virtualize the addresses that a specific magic %fs and magic %gs select, and we obviously wanted to use a similar technique on 32-bit applications, and on the 32-bit kernel too. We did this by defining specific %fs and %gs values that point into the GDT, and arranged that context switches update the corresponding underlying base address from predefined lwp-private values - either explicitly by rewriting the relevant GDT entries on the 32-bit kernel, or implicitly via the FSBASE and GSBASE MSRs on the 64-bit kernel. The result of all this work makes the code simpler, it scales cleanly, and the resulting upper bound on the number of LWPs is derived only from available memory (modulo resource controls, obviously).

Floating point

Most of the prework we had done to establish the SSE capabilities in the 32-bit kernel was readily reused for amd64; modulo some restructuring to allow the same code to be compiled appropriately for the two kernel builds. However, late in the development cycle, the guys in our floating point group pointed out that we didn't capture the results of floating point exceptions properly; the result of a subtle difference in the way that AMD and Intel processors presented information to the kernel after the floating point exception had been acknowledged. Fortunately they noticed this, and we rewrote the handler to be more robust and to behave the same way on both flavors of hardware.

Continuous Integration vs. One Giant Putback

To try to keep our merging and synchronization efforts under control, we did our best to integrate many of the changes we were making directly into the Solaris 10 gate so that the rest of the Solaris development organization could see it. This wasn't a willy-nilly integration of modified files, instead each putback was a regression-tested subset of the amd64 project that could stand alone if necessary. Perhaps I should explain this a little further. The Solaris organization has, for many years, tried to adhere to the principle of integrating complete projects, that is, changes that can stand alone, even if the follow-on projects are cancelled, fail, or become too delayed to make the release under development. Some of the code reorganization we needed was done this way, as well as most of the items I described as "prework" in part 1. There were also a bunch of code removal projects we did that helped us avoid the work of porting obsolete subsystems and support for drivers. As an aside, it's interesting to muse on exactly who is responsible to get rid of drivers for obsolete hardware; it's a very unglamourous task, but one that it's highly necessary if you aren't to flounder under and ever more opaque and untestable collection of crufty old source code.

In the end though, we got to the point where the pain of creating and testing subsets of our change by hand to create partial projects in Solaris 10 became just too painful for the team to countenance. Instead, we focussed on creating a single delivery of all our change in one coherent whole. Our Michigan-based "army of one," Roger Faulkner did all of this, as well as most of the rest of the heavy lifting in userland i.e. creating the 64-bit libc and basic C run-time etc. as well as the threading primitives. Roger really did an amazing job on the project.

Projects of this giant size and scope are always difficult; and everyone gets even more worried when the changes are integrated towards the end of a release. However, we did bring unprecedented levels of testing to the amd64 project, from some incredible, hard working test people. Practically speaking I think we did a reasonable job of getting things right by the end of the release, despite a few last minute scares around our mishandling of process-private LDTs. Fortunately these were only really needed for various forms of Windows emulation, so we disabled them on the 64-bit kernel for the FCS product; this works now in the Solaris development gate, and a backported fix is working its way through the system.

Not to say that there aren't bugs of course ...

Distributed Development

I think it's worth sharing some of the experiences of how the core team worked on this project. First, when we started, Todd Clayton (the engineering lead, who also did the segmentation work, among other things) and I asked to build a mostly-local team. We asked for that because we believed that time-to-market was critical, and we thought that we could go the fastest with all the key contributors in close proximity. However, for a number of reasons, that was not possible, and we ended up instead with a collection of talented people spread over many sites as geographically distributed as New Zealand, Germany, Boston, Michigan, and Colorado as well a small majority of the team back in California. To help unify the team and make rapid progress, we came up with the idea of periodically getting the team together physically in one place (either offsite in California or Colorado) and spending a focussed week together. We spent the first week occupying a contiguous block of adjacent offices in another building; problem was that we didn't really change the dynamics of the way people worked with each other. Our accidental discovery came during our first Colorado meeting where we ended up in one (large!) training room for our kick-off meeting. Rather than trudge back across campus where we had reserved office space, we decided to stay put and just start work where we were, and suddenly everything clicked. We stayed in the room for the rest of the week, working closely with each other, immersing ourselves in the project, the team, and what needed to be done. This was very effective, because as well as reinforcing the sense of team during the week away, everyone was able to go back to their home sites and work independently and effectively for many weeks before meeting up again - with only an occasional phone call or email between team-members to synchronize.

Looking Back

I've tried to do a reasonable tour of the amd64 project, driven mostly by what stuck in my memory, and biassed by the work I was involved in to some degree, but obviously much detail has been omitted or completely forgotten. To the people at Sun whose work or contribution I've either not mentioned, foolishly glossed over or forgotten completely, sorry, and thanks for your efforts. To the people at AMD that helped support us, another thank you. To our families and loved ones that put up with "one more make," yet more thanks. This was a lot of work, done faster than any of us thought possible, and 2004 was in truth, well, a bit of a blur.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Jun 08, 2005

Solaris 10 on x64 Processors: Part 3 - Kernel

Virtual Memory

One of the most critical components of a 64-bit operating system is it's ability to manage large amounts of memory using the additional addressing capabilities of the hardware. The key to those capabilities in Solaris is the HAT (Hardware Address Translation) "layer" of the otherwise generic VM system. Unfortunately, the 32-bit HAT layer for Solaris x86 was a bit long in the tooth and after years of neglect was extremely difficult to understand, let alone extend. So we decided on a ground-up rewrite pretty early on in the project; the eventual benefit of that was being able to use the same source code for both 32-bit and 64-bit mode, and to bring the benefits of the NX (no-execute) bit to both 32-bit and 64-bit kernels seamlessly. Joe Bonasera, who lead this work, told me a few weeks ago that he'd expand on this in his own blog here, so I'm not going to describe it any further than that.

Interrupts, DMA, DDI, device drivers

The Solaris DDI (Device Driver Interface) was designed to support writing portable drivers between releases, and between instruction sets, to concentrate bus-dependent details and interfaces in specialized bus-dependent drivers (called nexus drivers), and to minimize the amount of low-level, bus-specific code in regular drivers (called leaf drivers). Most of the work we did on the 64-bit SPARC project back in 1997 was completely reused, and the majority of the work on the x86 DDI implementation was essentially making the code LP64 clean, and fixing some of the more hacky internals of some of the nexus drivers.

The most difficult part of the work was porting the low-level interrupt handlers, which were a monumental mass of confusing assembler. Though I had thought that it would be simplest to port the i386 assembler to amd64 conventions, this turned out to have been a poor decision. Sherry Moore tried to get this done quickly and accurately, but it was a very difficult challenge. We spent many days debugging problems with interrupts that were really rooted in the differences in register allocations between the two instruction set architectures and ABIs, as well as the highly contorted nature of the original code. We spent so much time on it that I eventually became consumed with guilt and rewrote most of it in C, which unsurprisingly turned out to be much easier to debug, and is now probably the best way to understand how the threads-as-interrupts implementation actually works.

The remaining work can be split into two parts. The first was ensuring that the drivers properly described their addressing capabilities, particularly those that hadn't been updated in a while. The second was the usual problem of handling ioctls from 32-bit and 64-bit applications where the two environments use different size and alignments for the data types passed across the interface. Again, Solaris already had a bunch of mechanism for doing this which we simply reused on previously i386-specific drivers to make them usable on amd64 kernels too.

One slight thorn in our side was the different in alignment constraints for the long long data type. On 32-bit SPARC and 64-bit SPARC, the alignment is 8 bytes for both, however, between i386 and amd64, the alignment changes from 4 bytes to 8 bytes. This seems mildly arcane, until you recall that the alignment of these data types controls the way that basic data structures are laid out between the two ABIs. Data structures containing long long types that were compatible between a 32-bit SPARC application and the 64-bit SPARC kernel now needed special handling for a 32-bit x86 application running on a 64-bit amd64 kernel. The same problem was discovered in a few network routing interfaces, cachefs, priocntl etc. Once we'd debugged a couple of these by hand, Ethan Solomita started a more systematic effort to locate the remaining problems; Mike Shapiro suggested that we build a CTF tool that would help us find the rest more automatically, or at least semi-automatically, which was an excellent idea and helped enormously.

MP bringup, EM64-T bringup

Back in 1990, one of the core design goals of the SunOS 5.0 project was to build a multithreaded operating system designed to run on multiprocessor machines. We weren't just doing a simple port of SVR4 to SPARC, we reworked the scheduler, and invested a large amount of effort throughout the kernel, adding fine-grain locking to extract the maximal concurrency from the hardware. Fast forward to 2005, and we're still working on it! The effort to extend scalability remains one of our core activities. However, we didn't have to do a lot of work to make multiprocessor Opteron machines run the 64-bit kernel; apart from porting the locking primitives, the only porting work was around creating a primitive environment around the non-boot processors to switch them into long mode. William Kucharski (of amd64 booter fame) did this work in a week or so, and impressed us all with how quickly and how well this worked from the beginning.

We also wanted to run our 64-bit kernel on Intel's EM64-T CPUs, since we really do want Solaris to run well on non-Sun x86 and x64 systems. As we were doing other work on the system, we had been anticipating what we needed to do from Intel's documentation, so as soon as the hardware was publically available (unfortunately we weren't able to get them earlier from Intel) Russ Blaine started working on it and had the 64-bit kernel up and running multiuser in about a week. I'm not sure if that's because Intel's specifications are particularly well written, or because Russ's debugging skills were even more excellent that week, or if it's testament to the skills of the Intel engineers at making their processor be so compatible with the Opteron architecture, but we were pretty pleased with the result.

Debugging Infrastructure

Critical aspects of the debugging architecture of Solaris that needed to be ported include the CTF system for embedding dense type information in ELF files, and the corresponding library and toolchain infrastructure that manipulates it, libproc that encapsulates a bunch of /proc operations for the ptools, /proc itself, mdb, and the DTrace infrastructure. I worked on the easy part - /proc - the difficult work was done by Matt Simmons, Eric Schrock and for DTrace, Adam Leventhal and of course Bryan Cantrill.

At the same time as we were starting our bring-up efforts on Opteron, an unrelated project in the kernel group was busy creating a new debugging architecture based on mdb(1). The basic idea was that we wanted to be able to bring most of mdb's capabilities to debugging live kernel problems. The kmdb team observed that our existing kernel debugger, kadb, was always in a state of disrepair, and yet because of it's co-residence with the kernel, needs constant tweaking for new platforms. So rather than continue this state of affairs, they came to the idea that it would be simpler if we could assume that the Solaris kernel would provide the basic infrastructure for the debugger.

This has considerable advantages for incremental development, and for the vast majority of kernel developers who aren't working on new platform bringup this is clearly a Good Thing. But it does make porting to a fresh platform or instruction set a little more difficult because kmdb is sophisticated, and doesn't really work until some of the more difficult kernel code has been debugged into existence. The amd64 project had that problem in a particularly extreme form, because the debugger design and interfaces were under development at the same time as we needed them. As a result, the early amd64 kernel bringup work was really done using a simulator (SIMICS), and then by doing printf-style debugging, and post-mortem trap-tracing, than with kmdb. I still remember debugging init(1M) using the simulator on the last day of one of our offsites in San Francisco, figuring out the bug while riding BART back home.

At this point of course, kmdb works fine and is of great help when debugging more subtle problems. However, knowing what we know now, we should have built a simple bringup-debugger to get us through those early stages where almost nothing worked. Something that could catch and decode exceptions, do stack traces and dump memory would be enough. I'd certainly recommend that path to anyone thinking of porting Solaris to another instruction set architecture; as soon as you get to the point that the kernel starts taking interrupts and doing context switches, things get way too hard for printf-style debugging!

System calls Revisited

For 64-bit applications we used the syscall instruction. We used the same register calling conventions as Linux; these are somewhat forced upon you by the combination of the behaviour of the instruction, and the C calling convention, and besides, there is no value in being deliberately different.

Interestingly, the 64-bit system call parameter passing convention is extremely similar to SPARC i.e. the first six system call arguments are passed in registers, with additional arguments passed on the stack. As a result, we based the 64-bit system call handler algorithm for amd64 on the 64-bit handler for sparcv9.

The 32-bit system call handlers include the 32-bit variant of the syscall instruction which works sufficiently well when the processor is running the 64-bit kernel to be usable. We also made the sysenter instruction work for Intel CPUs, and of course, the lcall handler; though this is actually handled via a #np trap in C. Our latest version of this assigns a new int trap to 32-bit syscalls which will improve the performance of the various types of system call that don't work well with plain syscall or sysenter.

More Tool Chain Issues

In the earlier "preliminaries" blog, I mentioned our use of gcc; however the Solaris kernel contains its own linker, krtld, based on the same relocation engine used in the userland utility. Fortunately, we had Mike Walker to do the amd64 linker work early on; we had a working linker a week or two ahead of having a linkable kernel.

One more thing

In my first posting on this topic I neglected to mention that there's a really good reference work for people trying to navigate the Solaris kernel - the book by Jim Mauro and Richard McDougall called Solaris Internals: Core Kernel Components; ISBN 0130224960.

Next time, I'll describe more of the userland work that completed the port.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris


Archives
Links
Referrers