For the past year, I've been busy on the team that is porting OpenSolaris to run as a fully para-virtualized domain under the Xen hypervisor. The areas I've been concentrating on are changes to virtual and physical memory management and the mechanisms by which OpenSolaris gets loaded and started, aka boot.
Memory Management under Xen
The changes to physical memory management translate what OpenSolaris calls a Page Frame Number (PFN or pfn_t)
into Machine Frame Numbers (MFNs) under Xen before using them in page
tables, descriptor tables or programming DMA. Under Xen addresses
derived from PFNs are referred to as pseudo-Physical addresses and are
used in the kernel with the existing type paddr_t. Note that not all
MFN values the kernel sees can be translated into PFNs, so a way to distinguish them was needed. Several routines
were added to the kernel to deal with these translation issues.
The changes to virtual memory management are primarily around:
- The HAT must translate PFNs into MFNs when creating page table entries and do the reverse translation, MFN to PFN, when examining pagetables.
- Xen requires that page tables that are in active use be mapped read-only. The code to access page tables in the HAT is now aware of when it should be using read-only mappings.
- Changing the algorithm used for TLB shootdowns. Xen provides a single interface to simultaneously change a page table entry and invalidate TLB entries. To reduce the differences between Xen and non-Xen code, the HAT code was restructured.
Some kmdb dcmds have been modified and new ones introduced to help
manage the difference between PFN and MFNs during kernel development or crash analysis.
Booting the Kernel
The changes to the way OpenSolaris boots were extensive and complicated. The goal was to make the boot time code used on plain hardware and the code used under Xen as similar as possible. As part of that approach we decide to eliminate the separate boot loader found in /boot/multiboot altogether.
Review of Pre-Xen Boot
As a refresher, the pre-Xen version of OpenSolaris gets into memory in the following way on x64 hardware:
- Grub is used to load the /boot/multiboot program and the boot_archive into memory.
- multiboot then determines which version of unix in the boot_archive to boot based on what sort of hardware (32 or 64 bit) is present and any command line information passed to it in the menu.lst file.
- multiboot builds an intial set of 32 bit page tables to enable it to load the unix executable at the appropriate place in virtual memory as described the the unix ELF file. When booting the 64 bit kernel, an optional 2nd layer is used to automatically double map the 32 bit virtual memory into the top of 64 bit virtual memory.
- The "unix" executable is rather incomplete (ie. it won't run by itself) but has embedded in it a PT_INTERP section that points to the krtld (kernel runtime loader) module. multiboot combines krtld from the boot_archive with unix as it loads both into memory.
- Execution actually starts in krtld. Additional modules needed by the kernel are loaded by krtld from the boot_archive. Once the kernel is complete enough to run, execution in the kernel finally begins.
- multiboot continues to be used, via the BOP_X*() interfaces, to manage virtual memory and console I/O until the kernel has initialized itself enough to take over.
New Approach to Boot
This seemed like a lot of code to port to Xen, especially since multiboot effectively is just a memory allocator and ELF file decoder. An additional problem was that multiboot was very much a 32 bit program, but on amd64 platforms the Xen domain is always entered in 64 bit mode. A lot of tedious clean up work would be required to make mutltiboot even compile, let along work, as a 64 bit program. We decided to make the following changes to the way in which we build Unix:
- Link krtld (as well as enough other code) into the unix ELF file at build time. Hence, there is no more PT_INTERP section in unix.
- We rely on grub to load the unix file directly. For amd64 kernels this relies on grub's a.out hack code to load the 64 bit ELF based on an embedded multiboot header.
- The unix ELF file's text and data segments now have explicitly specified physical load addresses which are at 4 Meg and 8 Meg.
- A third loadable segment was added to the unix ELF file. The code in this segment is compiled to load and run at address 12 Meg. The code is always 32 bit executable on hardware, but is native when under Xen. It contains the ELF (or multiboot header) specified entry point. We call this code "dboot", short for Direct Boot.
Using this new version of the unix file, the following happens at boot:
- Grub loads the UNIX file, either as 32 bit ELF or 64 bit using the a.out hack and transfers control to the dboot code.
- The dboot code builds page tables that exactly match what the booted kernel (64 bit, 32 bit PAE or 32 bit non-PAE) will use. The page table entries include mappings for the kernel text and data at the correct high virtual memory addresses.
- For non-Xen, dboot activates paging mode
- The dboot code finally jumps into unix kernel text.
- The entry point in unix, _start, is provided by i86pc/os/fake_bop.c. As the name implies, this is kernel code which emulates the old BOP_*() interfaces that the rest of kernel startup relies on.
This new boot approach is much smaller and simpler. It also removes many artificial restrictions that startup.c had to deal with, like a 32 bit allocator in the 64 bit kernel. You can read more about these in Nils blog.
As an additional clean up, the code to manage console I/O and to deal with boot time page table and memory management was made "common" source between the dboot code and what the kernel needed in early startup.
The big benefit for the Xen port was that the dboot code was
easy to port to Xen. Since much of the code is now common between
dboot and the rest of the kernel, it was designed to work from the
beginning in a 64 bit environment.
menu.lst changes
The new way of booting requires you to specify the kernel you want to boot explicitly in your grub menu.lst file. You can see more of what is going on by adding prom_debug=true,kbm_debug=true
to your menu.lst file. This is done by adding the -B
title 32 bit OpenSolaris with boot time debug output
kernel /platform/i86pc/kernel/unix -B prom_debug=true,kbm_debug=true
module /platform/i86pc/boot_archive
title 64 bit OpenSolaris no debug output, but console I/O to serial port
kernel /platform/i86pc/kernel/amd64/unix -B console=ttya
module /platform/i86pc/boot_archive
Under Xen you include these settings in your domain builder configuration file in the "extra" property.