OS Technology

Tim Marsland's Weblog
Tuesday Jun 14, 2005

Opening Day

This is opening day, and I want to say "Welcome!" to everyone that's interested in taking a look under the hood of, and tinkering with, our favourite operating system. It's taken many of us a lot of hard work to get this far, and yet this is where the conversation starts, and the journey really begins.

I'm really looking forward to participating, and seeing what we, the OpenSolaris community, can build. Together.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Solaris 10 on x64 Processors: Part 4 - Userland

Userland

The amount of work involved in the kernel part of the amd64 project was fairly large, fortunately the userland part was more straightforward because of our prior work on 64-bit Solaris on SPARC back in 1997. So, for this project, once the kernel work, which abstracts the hardware differences between processors, was done, many smaller tasks appeared that were mostly solved by tweaking Makefiles and finding occasional #ifdefs that needed something added or modified. Fortunately, it was also work that was done in parallel by many people from across the organizations that contribute to the Solaris product.

Of course there were other substantial pieces of work like the Sun C and C++ compilers, and the Java Virtual Machine; though the JVM was already working on 32-bit and 64-bit Solaris on SPARC as well as 32-bit on x86, and the Linux port of the JVM had already caused that team to explore many of the amd64 code generation issues.

One of the things we tried to do was to be compatible with the amd64 ABI on Linux. As we talked to industry partners, we discovered that there was a variety of interpretations of the term "ABI." Many of the people we talked to outside of Sun thought that "ABI" only referred to register usage, C calling conventions, data structure sizes and alignments. A specification for compiler and linker writers, but with little or nothing beyond that about the system interfaces an application can actually invoke. But, the System V ABI is a larger concept than that, and was at least intended to provide a sufficient set of binary specifications to allow complete application binaries to be constructed that could be built once, and run on any ABI-conformant implementation. Thus Sun engineers tend to think of "the ABI" as being the complete set of interfaces used by user applications, rather than just compiler conventions; and over the years we expanded this idea of maintaining a binary compatible interface to applications all the way to the Solaris application guarantee program.

Though we tried to be compatible at this level with Linux on amd64, we discovered a number of issues in the system call and library interfaces that made that difficult, and while we did eliminate gratuitous differences where we could, we eventually decided on a more pragmatic approach. We decided to be completely compatible with the basic "compiler" style view of the ABI, and simply try and make it simple to port applications from 32-bit Solaris to 64-bit Solaris, and from Solaris on sparcv9 to Solaris on x64, and leave the thornier problems of full 64-bit Linux application compatibility to the Linux Application Environment (LAE ) project.

Threads and Selectors

In previous releases of Solaris, the 32-bit threads library used the %gs selector to allow each LWP in a process to refer to a private LDT entry to provide the per-thread state manipulated by the internals of the thread library. Each LWP gets a different %gs value that selects a different LDT entry; each LDT entry is initialized to point at per-thread state. On LWP context switch, the kernel loads the per-process LDT register to virtualize all this data to the process. Workable, yes, but the obvious inefficiency here was requiring every process to have at least one extra locked-down page to contain a minimal LDT. More serious, was the implied upper bound of 8192 LWPs per process (derived from the hardware limit on LDT entries).

For the amd64 port, following the draft ABI document, we needed to use the %fs selector for the analogous purpose in 64-bit processes too. On the 64-bit kernel, we wanted to use the FSBASE and GSBASE MSRs to virtualize the addresses that a specific magic %fs and magic %gs select, and we obviously wanted to use a similar technique on 32-bit applications, and on the 32-bit kernel too. We did this by defining specific %fs and %gs values that point into the GDT, and arranged that context switches update the corresponding underlying base address from predefined lwp-private values - either explicitly by rewriting the relevant GDT entries on the 32-bit kernel, or implicitly via the FSBASE and GSBASE MSRs on the 64-bit kernel. The result of all this work makes the code simpler, it scales cleanly, and the resulting upper bound on the number of LWPs is derived only from available memory (modulo resource controls, obviously).

Floating point

Most of the prework we had done to establish the SSE capabilities in the 32-bit kernel was readily reused for amd64; modulo some restructuring to allow the same code to be compiled appropriately for the two kernel builds. However, late in the development cycle, the guys in our floating point group pointed out that we didn't capture the results of floating point exceptions properly; the result of a subtle difference in the way that AMD and Intel processors presented information to the kernel after the floating point exception had been acknowledged. Fortunately they noticed this, and we rewrote the handler to be more robust and to behave the same way on both flavors of hardware.

Continuous Integration vs. One Giant Putback

To try to keep our merging and synchronization efforts under control, we did our best to integrate many of the changes we were making directly into the Solaris 10 gate so that the rest of the Solaris development organization could see it. This wasn't a willy-nilly integration of modified files, instead each putback was a regression-tested subset of the amd64 project that could stand alone if necessary. Perhaps I should explain this a little further. The Solaris organization has, for many years, tried to adhere to the principle of integrating complete projects, that is, changes that can stand alone, even if the follow-on projects are cancelled, fail, or become too delayed to make the release under development. Some of the code reorganization we needed was done this way, as well as most of the items I described as "prework" in part 1. There were also a bunch of code removal projects we did that helped us avoid the work of porting obsolete subsystems and support for drivers. As an aside, it's interesting to muse on exactly who is responsible to get rid of drivers for obsolete hardware; it's a very unglamourous task, but one that it's highly necessary if you aren't to flounder under and ever more opaque and untestable collection of crufty old source code.

In the end though, we got to the point where the pain of creating and testing subsets of our change by hand to create partial projects in Solaris 10 became just too painful for the team to countenance. Instead, we focussed on creating a single delivery of all our change in one coherent whole. Our Michigan-based "army of one," Roger Faulkner did all of this, as well as most of the rest of the heavy lifting in userland i.e. creating the 64-bit libc and basic C run-time etc. as well as the threading primitives. Roger really did an amazing job on the project.

Projects of this giant size and scope are always difficult; and everyone gets even more worried when the changes are integrated towards the end of a release. However, we did bring unprecedented levels of testing to the amd64 project, from some incredible, hard working test people. Practically speaking I think we did a reasonable job of getting things right by the end of the release, despite a few last minute scares around our mishandling of process-private LDTs. Fortunately these were only really needed for various forms of Windows emulation, so we disabled them on the 64-bit kernel for the FCS product; this works now in the Solaris development gate, and a backported fix is working its way through the system.

Not to say that there aren't bugs of course ...

Distributed Development

I think it's worth sharing some of the experiences of how the core team worked on this project. First, when we started, Todd Clayton (the engineering lead, who also did the segmentation work, among other things) and I asked to build a mostly-local team. We asked for that because we believed that time-to-market was critical, and we thought that we could go the fastest with all the key contributors in close proximity. However, for a number of reasons, that was not possible, and we ended up instead with a collection of talented people spread over many sites as geographically distributed as New Zealand, Germany, Boston, Michigan, and Colorado as well a small majority of the team back in California. To help unify the team and make rapid progress, we came up with the idea of periodically getting the team together physically in one place (either offsite in California or Colorado) and spending a focussed week together. We spent the first week occupying a contiguous block of adjacent offices in another building; problem was that we didn't really change the dynamics of the way people worked with each other. Our accidental discovery came during our first Colorado meeting where we ended up in one (large!) training room for our kick-off meeting. Rather than trudge back across campus where we had reserved office space, we decided to stay put and just start work where we were, and suddenly everything clicked. We stayed in the room for the rest of the week, working closely with each other, immersing ourselves in the project, the team, and what needed to be done. This was very effective, because as well as reinforcing the sense of team during the week away, everyone was able to go back to their home sites and work independently and effectively for many weeks before meeting up again - with only an occasional phone call or email between team-members to synchronize.

Looking Back

I've tried to do a reasonable tour of the amd64 project, driven mostly by what stuck in my memory, and biassed by the work I was involved in to some degree, but obviously much detail has been omitted or completely forgotten. To the people at Sun whose work or contribution I've either not mentioned, foolishly glossed over or forgotten completely, sorry, and thanks for your efforts. To the people at AMD that helped support us, another thank you. To our families and loved ones that put up with "one more make," yet more thanks. This was a lot of work, done faster than any of us thought possible, and 2004 was in truth, well, a bit of a blur.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Jun 08, 2005

Solaris 10 on x64 Processors: Part 3 - Kernel

Virtual Memory

One of the most critical components of a 64-bit operating system is it's ability to manage large amounts of memory using the additional addressing capabilities of the hardware. The key to those capabilities in Solaris is the HAT (Hardware Address Translation) "layer" of the otherwise generic VM system. Unfortunately, the 32-bit HAT layer for Solaris x86 was a bit long in the tooth and after years of neglect was extremely difficult to understand, let alone extend. So we decided on a ground-up rewrite pretty early on in the project; the eventual benefit of that was being able to use the same source code for both 32-bit and 64-bit mode, and to bring the benefits of the NX (no-execute) bit to both 32-bit and 64-bit kernels seamlessly. Joe Bonasera, who lead this work, told me a few weeks ago that he'd expand on this in his own blog here, so I'm not going to describe it any further than that.

Interrupts, DMA, DDI, device drivers

The Solaris DDI (Device Driver Interface) was designed to support writing portable drivers between releases, and between instruction sets, to concentrate bus-dependent details and interfaces in specialized bus-dependent drivers (called nexus drivers), and to minimize the amount of low-level, bus-specific code in regular drivers (called leaf drivers). Most of the work we did on the 64-bit SPARC project back in 1997 was completely reused, and the majority of the work on the x86 DDI implementation was essentially making the code LP64 clean, and fixing some of the more hacky internals of some of the nexus drivers.

The most difficult part of the work was porting the low-level interrupt handlers, which were a monumental mass of confusing assembler. Though I had thought that it would be simplest to port the i386 assembler to amd64 conventions, this turned out to have been a poor decision. Sherry Moore tried to get this done quickly and accurately, but it was a very difficult challenge. We spent many days debugging problems with interrupts that were really rooted in the differences in register allocations between the two instruction set architectures and ABIs, as well as the highly contorted nature of the original code. We spent so much time on it that I eventually became consumed with guilt and rewrote most of it in C, which unsurprisingly turned out to be much easier to debug, and is now probably the best way to understand how the threads-as-interrupts implementation actually works.

The remaining work can be split into two parts. The first was ensuring that the drivers properly described their addressing capabilities, particularly those that hadn't been updated in a while. The second was the usual problem of handling ioctls from 32-bit and 64-bit applications where the two environments use different size and alignments for the data types passed across the interface. Again, Solaris already had a bunch of mechanism for doing this which we simply reused on previously i386-specific drivers to make them usable on amd64 kernels too.

One slight thorn in our side was the different in alignment constraints for the long long data type. On 32-bit SPARC and 64-bit SPARC, the alignment is 8 bytes for both, however, between i386 and amd64, the alignment changes from 4 bytes to 8 bytes. This seems mildly arcane, until you recall that the alignment of these data types controls the way that basic data structures are laid out between the two ABIs. Data structures containing long long types that were compatible between a 32-bit SPARC application and the 64-bit SPARC kernel now needed special handling for a 32-bit x86 application running on a 64-bit amd64 kernel. The same problem was discovered in a few network routing interfaces, cachefs, priocntl etc. Once we'd debugged a couple of these by hand, Ethan Solomita started a more systematic effort to locate the remaining problems; Mike Shapiro suggested that we build a CTF tool that would help us find the rest more automatically, or at least semi-automatically, which was an excellent idea and helped enormously.

MP bringup, EM64-T bringup

Back in 1990, one of the core design goals of the SunOS 5.0 project was to build a multithreaded operating system designed to run on multiprocessor machines. We weren't just doing a simple port of SVR4 to SPARC, we reworked the scheduler, and invested a large amount of effort throughout the kernel, adding fine-grain locking to extract the maximal concurrency from the hardware. Fast forward to 2005, and we're still working on it! The effort to extend scalability remains one of our core activities. However, we didn't have to do a lot of work to make multiprocessor Opteron machines run the 64-bit kernel; apart from porting the locking primitives, the only porting work was around creating a primitive environment around the non-boot processors to switch them into long mode. William Kucharski (of amd64 booter fame) did this work in a week or so, and impressed us all with how quickly and how well this worked from the beginning.

We also wanted to run our 64-bit kernel on Intel's EM64-T CPUs, since we really do want Solaris to run well on non-Sun x86 and x64 systems. As we were doing other work on the system, we had been anticipating what we needed to do from Intel's documentation, so as soon as the hardware was publically available (unfortunately we weren't able to get them earlier from Intel) Russ Blaine started working on it and had the 64-bit kernel up and running multiuser in about a week. I'm not sure if that's because Intel's specifications are particularly well written, or because Russ's debugging skills were even more excellent that week, or if it's testament to the skills of the Intel engineers at making their processor be so compatible with the Opteron architecture, but we were pretty pleased with the result.

Debugging Infrastructure

Critical aspects of the debugging architecture of Solaris that needed to be ported include the CTF system for embedding dense type information in ELF files, and the corresponding library and toolchain infrastructure that manipulates it, libproc that encapsulates a bunch of /proc operations for the ptools, /proc itself, mdb, and the DTrace infrastructure. I worked on the easy part - /proc - the difficult work was done by Matt Simmons, Eric Schrock and for DTrace, Adam Leventhal and of course Bryan Cantrill.

At the same time as we were starting our bring-up efforts on Opteron, an unrelated project in the kernel group was busy creating a new debugging architecture based on mdb(1). The basic idea was that we wanted to be able to bring most of mdb's capabilities to debugging live kernel problems. The kmdb team observed that our existing kernel debugger, kadb, was always in a state of disrepair, and yet because of it's co-residence with the kernel, needs constant tweaking for new platforms. So rather than continue this state of affairs, they came to the idea that it would be simpler if we could assume that the Solaris kernel would provide the basic infrastructure for the debugger.

This has considerable advantages for incremental development, and for the vast majority of kernel developers who aren't working on new platform bringup this is clearly a Good Thing. But it does make porting to a fresh platform or instruction set a little more difficult because kmdb is sophisticated, and doesn't really work until some of the more difficult kernel code has been debugged into existence. The amd64 project had that problem in a particularly extreme form, because the debugger design and interfaces were under development at the same time as we needed them. As a result, the early amd64 kernel bringup work was really done using a simulator (SIMICS), and then by doing printf-style debugging, and post-mortem trap-tracing, than with kmdb. I still remember debugging init(1M) using the simulator on the last day of one of our offsites in San Francisco, figuring out the bug while riding BART back home.

At this point of course, kmdb works fine and is of great help when debugging more subtle problems. However, knowing what we know now, we should have built a simple bringup-debugger to get us through those early stages where almost nothing worked. Something that could catch and decode exceptions, do stack traces and dump memory would be enough. I'd certainly recommend that path to anyone thinking of porting Solaris to another instruction set architecture; as soon as you get to the point that the kernel starts taking interrupts and doing context switches, things get way too hard for printf-style debugging!

System calls Revisited

For 64-bit applications we used the syscall instruction. We used the same register calling conventions as Linux; these are somewhat forced upon you by the combination of the behaviour of the instruction, and the C calling convention, and besides, there is no value in being deliberately different.

Interestingly, the 64-bit system call parameter passing convention is extremely similar to SPARC i.e. the first six system call arguments are passed in registers, with additional arguments passed on the stack. As a result, we based the 64-bit system call handler algorithm for amd64 on the 64-bit handler for sparcv9.

The 32-bit system call handlers include the 32-bit variant of the syscall instruction which works sufficiently well when the processor is running the 64-bit kernel to be usable. We also made the sysenter instruction work for Intel CPUs, and of course, the lcall handler; though this is actually handled via a #np trap in C. Our latest version of this assigns a new int trap to 32-bit syscalls which will improve the performance of the various types of system call that don't work well with plain syscall or sysenter.

More Tool Chain Issues

In the earlier "preliminaries" blog, I mentioned our use of gcc; however the Solaris kernel contains its own linker, krtld, based on the same relocation engine used in the userland utility. Fortunately, we had Mike Walker to do the amd64 linker work early on; we had a working linker a week or two ahead of having a linkable kernel.

One more thing

In my first posting on this topic I neglected to mention that there's a really good reference work for people trying to navigate the Solaris kernel - the book by Jim Mauro and Richard McDougall called Solaris Internals: Core Kernel Components; ISBN 0130224960.

Next time, I'll describe more of the userland work that completed the port.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris


Archives
Links
Referrers