Maintaining the compatibility of hardware virtualization solutions can be tricky. Below I'll
talk about two bugs that needed fixes in the Xen hypervisor. Both of them have
unfortunate implications for compatibility, but thankfully, the scope was limited.
Shortly after the release of 3.1.1, we discovered that all 64-bit processes in a Solaris domain
would segfault immediately. After much debugging and head-scratching, I eventually found the problem.
On AMD64, 64-bit processes trap into the kernel via the syscall instruction. Under Xen,
this will obviously trap to the hypervisor. Xen then 'bounces' this back to the relevant OS kernel.
On real hardware, %rcx and %r11 have specific meanings. Prior to 3.1.1, Xen
happened to maintain these values correctly, although the layout of the stack is very different
from real hardware. This was broken in the 3.1.1 release: as a result, the %rflags of each
process was corrupted, and segfaulted almost immediately. We fixed the bug in Solaris, so we would still work with 3.1.1. This was also fixed (restoring the
original semantics) in Xen itself in time for the 3.1.2 release. So there's a small window (early Solaris xVM releases and community versions of Xen 3.1.1)
where we're broken, but thankfully, we caught this pretty early. The lesson to be drawn? Clear documentation of
the hypervisor ABI would have helped, I think.
Around the same time, I noticed during code inspection that we were still setting PT_USER in PTE
entries on 64-bit. This had some nasty implications, but first, some background.
On 32-bit x86, Xen protects itself via segmentation: it carves out the top 64Mb, and refuses to let any
of the domains load a segment selector that allows read or write access to that part of the address space.
Each domain kernel runs in ring 1 so can't get around this.
On 64-bit, this hack doesn't work, as AMD64 does not provide full support for segmentation (given what
a legacy technique it is). Instead, and somewhat unfortunately, we have to use page-based permissions
via the VM system. Since page table entries only have a single bit
("user/supervisor") instead of being able to say "ring 1 can read, but ring 3 cannot",
the OS kernel is forced into ring 3. Normally, ring 3 is used for userspace code. So every time we switch
between the OS kernel and userspace, we have to switch page tables entirely - otherwise, the process could
use the kernel page tables to write to kernel address-space.
Unfortunately, this means that we have to flush the TLB every time, which has a nasty performance cost.
To help mitigate this problem, in Xen 3.0.3, an incompatible change was made. Previously, so that the kernel
(running in ring 3, remember) could access its address space, it had to set PT_USER int its kernel
page table entries (PTEs). With 3.0.3, this was changed: now, the hypervisor would automatically do that.
Furthermore, if Xen did see a PTE with PT_USER set, then it assumed this was a userspace
mapping. Thus, it also set PT_GLOBAL, a hardware feature - if such a bit is set, then a corresponding
TLB entry is not flushed.
This meant that switching between userspace and the OS kernel was much faster, as the TLB entries for userspace
were no longer flushed.
Unfortunately, in our kernel, we missed this change in some crucial places, and until we fixed the bug above,
we were setting PT_USER even on kernel mappings. This was fairly obviously A Bad Thing: if you caught
things just right, a kernel mapping would still be present in the TLB when a user-space program was running,
allowing userspace to read from the kernel! And indeed, some simple testing showed this:
dtrace -qn 'fbt:genunix::entry /arg0 > `kernelbase/ { printf("%p ", arg0); }' | \
xargs -n 1 ~johnlev/bin/i386/readkern | while read ln; do echo $ln::whatis | mdb -k ; done
With the above use of DTrace, MDB, and a little program that attempts to read addresses,
we can see output such as:
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01c8c98438 is ffffff01c8c983e8+50, bufctl ffffff01c8ebf8d0 allocated from as_cache
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
Thankfully, the fix was simple: just stop adding PT_USER to our kernel PTE entries. Or so I thought.
When I did that, I noticed during testing that the userspace mappings weren't getting PT_GLOBAL
set after all (big thanks to MDB's ::vatopfn, which made this easy to see).
Yet more investigation revealed the problem to be in the hypervisor. Unlike certain other popular
OSes used with Xen, we set PTE entries in page tables using atomic compare and swap operations.
Remember that under Xen, page tables are read-only to ensure safety. When an OS kernel tries
to write a PTE, a page fault happens in Xen. Xen recognises the write as an attempt to update
a PTE and emulates it. However, since it hadn't been tested, this emulation path was broken:
it wasn't doing the correct mangling of the PTE entry to set PT_GLOBAL. Once again,
the actual fix was
simple.
By the way, that same putback also had the implementation of:
I'd been doing an awful lot of paging through ::threadlist
output recently, and always having to jump through all the (usually
irrelevant) taskq threads was driving me insane. So now you can
just specify ::threadlist -t and get a much, much, shorter list.
Tags: Xen
OpenSolaris
xVM
Trackback URL: http://blogs.sun.com/levon/entry/xen_compatibility_with_solaris