|
OpenSolaris 2009.06 guest domain on a Linux dom0
Just a quick note: you can follow the instructions I provided for the 2008.11 release, with one change. On a 64-bit machine, replace any instances of /boot/x86.microroot with /boot/amd64/x86.microroot. As of 2009.06, the boot archive is split into 32-bit and 64-bit variants. If you get a message like this:
krtld: failed to open '/platform/i86xpv/kernel/amd64/unix'
Then you've probably given the wrong combination of unix and microroot.
By the way, in my previous entry, I mentioned we were working on upstreaming our virt-install changes. During the Xen 3.3 work (more on which soon), I updated to the latest versions and got the needed parts into the upstream version. We've still some ZFS changes to push, but if you're running a recent enough version of Xen on Linux, you may well be able to use virt-install and skip all this horrible hacking!
Begone, trailing spaces!
I read my work email with mutt on a Solaris 9 box. For a while it's been irritating me that when you attempt to cut
and paste, it will include trailing spaces on each line instead of stopping at the last "real" character. Some Googling suggested this was because of the lack of the BCE attribute in my xterm-color terminfo definition. Rather than
learn how to compile terminfo entries (I've done it before, but I don't want to learn again!), I took the lazier approach:
copy /usr/share/terminfo/s/screen-256color-bce from a Fedora 8 box into /home/johnlev/.terminfo/s/, and start mutt with TERM and TERMINFO set appropriately. Now I can cut and paste sanely again.
Tags: Solaris
mutt
OpenSolaris 2008.11 as a dom0
UPDATE: the canonical location for this information is now here - please check there, as it will be updated as necessary, unlike this blog entry.
As a final part to my entries on OpenSolaris and Xen, let's go through the steps needed to turn
OpenSolaris into a dom0. Thanks to Trevor O for documenting this for 2008.05. And as before, expect this process to get much, much, easier soon!
I'm going to do the work in a separate BE, so if we mess up, we shouldn't have broken anything. So, first we create
our BE:
$ pfexec beadm create -a -d xvm xvm
First, let's install the packages. If you've updated to the development version, a simple pkg install xvm-gui will work, but let's assume
you haven't:
$ pfexec beadm mount xvm /tmp/xvm-be
$ pfexec pkg -R /tmp/xvm-be install SUNWvirt-manager SUNWxvm SUNWvdisk SUNWvncviewer
$ pfexec beadm umount xvm
Now we need to actually reboot into Xen. Unfortunately beadm is not yet aware of how to do this, so we'll have
to hack it up. We're going to run some awk over the menu.lst file which controls grub:
$ awk '
/^title/ { xvm=0; }
/^title.xvm$/ { xvm=1; }
/^(splashimage|foreground|background)/ {
if (xvm == 1) next
}
/^kernel\$/ {
if (xvm == 1) {
print("kernel\$ /boot/\$ISADIR/xen.gz")
sub("^kernel\\$", "module$")
gsub("console=graphics", "console=text")
gsub("i86pc", "i86xpv")
$2=$2 " " $2
}
}
{ print }' /rpool/boot/grub/menu.lst >/var/tmp/menu.lst.xvm
Let's check that the awk script (my apologies) worked properly:
$ tail /var/tmp/menu.lst.xvm
...
#============ End of LIBBE entry =============
title xvm
findroot (pool_rpool,0,a)
bootfs rpool/ROOT/xvm
kernel$ /boot/$ISADIR/xen.gz
module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix -B $ZFS-BOOTFS,console=text
module$ /platform/i86pc/$ISADIR/boot_archive
#============ End of LIBBE entry =============
Looks good. We'll move it into place, and reboot:
$ pfexec cp /rpool/boot/grub/menu.lst /rpool/boot/grub/menu.lst.saved
$ pfexec mv /var/tmp/menu.lst.xvm /rpool/boot/grub/menu.lst
$ pfexec reboot
This should boot you into xVM. If everything worked OK, let's enable the services:
$ svcadm enable -r xvm/virtd ; svcadm enable -r xvm/domains
At this point, you should be able to merrily go ahead and install domains!
Update: Todd Clayton pointed out the issue I've filed here: SUNWxvm needs to depend on SUNWvdisk. I've updated the instructions above with
the workaround.
Update update: Rich Burridge has fixed it. Nice!
Tags: OpenSolaris
Xen
xVM
OpenSolaris 2008.11 guest domain on a Linux dom0
My previous blog post described
how to install OpenSolaris 2008.11 on a Solaris dom0 under Xen. This also works on with a Linux dom0. However,
since upstream is missing some of our dom0 fixes, it's unfortunately more complicated. In particular,
we can't use virt-install, as it doesn't know about Solaris ISOs, and later on, we can't use
pygrub to boot from ZFS, since it doesn't know how to read such a filesystem. Bear with me,
this gets a little awkward.
This example is using a 32-bit Fedora 8 installation. Your milage is likely to vary if you're
using a different version, or another Linux distribution. First some of the configuration parameters
you might want to change:
export name="domu-224"
export iso="/isos/osol-2008.11.iso"
export dompath="/export/guests/2008.11"
export rootdisk="$dompath/root.img"
export unixfile="/platform/i86xpv/kernel/unix"
If you're on 64-bit Linux, set unixfile="/platform/i86xpv/kernel/amd64/unix" instead.
We need to create ourselves a 10Gb root disk:
mkdir -p $dompath
dd if=/dev/zero count=1 bs=$((1024 * 1024)) seek=10230 of=$rootdisk
Now let's use the configuration we need to install OpenSolaris:
cat >/tmp/domain-$name.xml <<EOF
<domain type='xen'>
<name>$name</name>
<bootloader>/usr/bin/pygrub</bootloader>
<bootloader_args>--kernel=/platform/i86xpv/kernel/unix --ramdisk=/boot/x86.microroot</bootloader_args>
<memory>1048576</memory>
<on_reboot>destroy</on_reboot>
<devices>
<interface type='bridge'>
<source bridge='eth0' />
<--
If you have a static DHCP setup, add the domain's MAC address here
<mac address='00:16:3e:1b:e8:18' />
-->
</interface>
<disk type='file' device='cdrom'>
<driver name='file' />
<source file='$iso' />
<target dev='xvdc:cdrom' />
</disk>
<disk type='file' device='disk'>
<driver name='file' />
<source file='$rootdisk' />
<target dev='xvda' />
</disk>
</devices>
</domain>
EOF
And start up the domain:
virsh create /tmp/domain-$name.xml
virsh console $name
Now you're dropped into the domain's console, and you can use the VNC trick I described to do the install. Answer the questions, wait for
the domain to DHCP, then:
domid=`virsh domid $name`
ip=`/usr/bin/xenstore-read /local/domain/$domid/ipaddr/0`
port=`/usr/bin/xenstore-read /local/domain/$domid/guest/vnc/port`
/usr/bin/xenstore-read /local/domain/$domid/guest/vnc/passwd
vncviewer $ip:$port
At this point, you can proceed with the installation as normal. Before you reboot though, we need to do some
tricks, due to the lack of ZFS support mentioned above. Whilst still in the live CD environment, bring up
a terminal. We need to copy the new kernel and ramdisk to the Linux dom0. We can automate this via a handy script:
#/bin/bash
dom0=$1
dompath=$2
unixfile=/platform/i86xpv/kernel/$3/unix
root=`pfexec beadm list -H | grep ';N*R;' | cut -d \; -f 1`
mkdir /tmp/root
pfexec beadm mount $root /tmp/root 2>/dev/null
mount=`pfexec beadm list -H $root | cut -d \; -f 4`
pfexec bootadm update-archive -R $mount
scp $mount/$unixfile root@$dom0:$dompath/kernel.$root
scp $mount/platform/i86pc/$3/boot_archive root@$dom0:$dompath/ramdisk.$root
pfexec beadm umount $root 2>/dev/null
echo "Kernel and ramdisk for $root copied to $dom0:$dompath"
echo "Kernel cmdline should be:"
echo "$unixfile -B zfs-bootfs=rpool/ROOT/$root,bootpath=/xpvd/xdf@51712:a"
For example, we might do:
/tmp/update_dom0 linux-dom0 /export/guests/2008.11
or on 64-bit:
/tmp/update_dom0 linux-dom0 /export/guests/2008.11 amd64
Now, you can finish the installation by clicking the reboot button. This will shut down the domain, ready to run.
But first we need the configuration file for running the domain:
cat >/$dompath/$name.xml <<EOF
<domain type='xen'>
<name>$name</name>
<os>
<kernel>$dompath/kernel.opensolaris</kernel>
<initrd>$dompath/ramdisk.opensolaris</initrd>
<cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris,bootpath=/xpvd/xdf@51712:a</cmdline>
</os>
<memory>1048576</memory>
<devices>
<interface type='bridge'>
<source bridge='eth0'/>
</interface>
<disk type='file' device='disk'>
<driver name='file' />
<source file='$rootdisk' />
<target dev='xvda' />
</disk>
</devices>
</domain>
virsh define $dompath/$name.xml
virsh start $name
virsh console $name
It should be booting, and you're (finally) done!
Updating the guest
Unfortunately we're not quite out of the woods yet. What we have works fine, but if we update the guest via
pkg image-update, we'll need to make changes in dom0 to boot the new boot environment. The update_dom0
script above will do a fine job of copying out the new kernel and ramdisk for the BE that's active on reboot,
but you also need to edit the config file. For example, if I wanted to boot into the new BE called opensolaris-1, I'd replace these lines:
<kernel>$dompath/kernel.opensolaris</kernel>
<initrd>$dompath/ramdisk.opensolaris</initrd>
<cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris,bootpath=/xpvd/xdf@51712:a</cmdline>
with these:
<kernel>$dompath/kernel.opensolaris-1</kernel>
<initrd>$dompath/ramdisk.opensolaris-1</initrd>
<cmdline>$unixfile -B zfs-bootfs=rpool/ROOT/opensolaris-1,bootpath=/xpvd/xdf@51712:a</cmdline>
then re-configure the domain (whist it's shut down) via virsh undefine $name ; virsh define $dompath/$name.xml.
Yes, we're aware this is rather over-complicated. We're trying to find the time to send our changes
to virt-install upstream, as well as ZFS support.
Eventually this will make it much easier to use a Linux dom0.
Tags: OpenSolaris
Xen
Linux
OpenSolaris 2008.11 as a para-virtual Xen guest
UPDATE: the canonical location for this information is now here - please check there, as it will be updated as necessary, unlike this blog entry.
As well obviously working with VirtualBox, OpenSolaris can also run
as a guest domain under Xen. The installation CD ships with the paravirtual extensions so you can
run it as a fully para-virtualized guest. This provides a significant advantage over fully-virtualized guests,
or even guests with para-virtual drivers like Solaris 10 Update 6. Of course, if you choose to, you can
still run OpenSolaris fully-virtualized (a.k.a. HVM mode), but there's little advantage to doing so.
One slight wrinkle is that Solaris guests don't yet implement the virtual framebuffer that the Xen infrastructure supports. Since OpenSolaris
doesn't yet have a text-mode install, this means that to install such a PV guest, we need a way to bring up
a graphical console.
With 2008.11, this is considerably easier. Presuming we're running a Solaris dom0 (either Nevada or OpenSolaris, of course), let's start an install of 2008.11:
# zfs create rpool/zvol
# zfs create -V 10G rpool/zvol/domu-220-root
# virt-install --nographics --paravirt --ram 1024 --name domu-220 -f /dev/zvol/dsk/rpool/zvol/domu-220-root -l /isos/osol-2008.11.iso
This will drop you into the console for the guest to ask you the two initial questions. Since they're not really important in this circumstance, you can just choose the defaults.
This example presumes that you have a DHCP server set up to give out dynamic addresses. If you only hand out addresses statically based on MAC address, you can also specify the --mac option. As OpenSolaris more-or-less assumes DHCP, it's recommended to set one up.
Now we need a graphical console in order to interact with the OpenSolaris installer. If the guest domain successfully finished booting the live CD, a VNC server should be running. It has recorded the details of this server in XenStore.
This is essentially a name/value config database used for communicating between guest domains and the control domain (dom0). We can start a VNC session as follows:
# domid=`virsh domid domu-220`
# ip=`/usr/lib/xen/bin/xenstore-read /local/domain/$domid/ipaddr/0`
# port=`/usr/lib/xen/bin/xenstore-read /local/domain/$domid/guest/vnc/port`
# /usr/lib/xen/bin/xenstore-read /local/domain/$domid/guest/vnc/passwd
DJP9tYDZ
# vncviewer $ip:$port
At the VNC password prompt, enter the given password, and this should bring up a VNC session, and you can merrily install away.
Implementation
The live CD runs a transient SMF service system/xvm/vnc-config. If it finds itself running on a live CD,
it will generate a random VNC password, configure application/x11/x11-server to start Xvnc, and
write the values above to XenStore. When application/graphical-login/gdm starts, it will read these service
properties and start up the VNC server. The service system/xvm/ipagent tracks the IPv4 address given to the first running interface and writes it to XenStore.
By default, the VNC server is configured not to run post-installation due to security concerns. This can be changed though, as follows:
# svccfg -s x11-server
setprop options/xvm_vnc = "true"
Please remember that VNC is not secure. Since you need elevated privileges to read the VNC password from XenStore,
that's sufficiently protected, as long as you always run the VNC viewer locally on the dom0, or via SSH tunnelling or
some other secure method.
Note that this works even with a Linux dom0, although you can't yet use virt-install, as the upstream version
doesn't yet "know about" OpenSolaris (more on this later).
Tags: OpenSolaris
Xen
xVM
Building OpenSolaris ISOs
I've recently been figuring out to build OpenSolaris ISOs (from SVR4 packages). It's surprisingly easy,
but at least the IPS part is not well documented, so I thought I'd write up how I do it.
There are three main things you're most likely to want to do: build IPS itself, populate an IPS repository,
and build an install ISO based on that repository. First, you'll want a copy of the IPS gate:
hg clone ssh://anon@hg.opensolaris.org/hg/pkg/gate pkg-gate
For some of my testing, I wanted to test some changed packages. So I mounted a Nevada DVD on /mnt/,
then, using mount -F lofs, replaced some of the package directories with ones I'd built previously
with my fixes. This effectively gave me a full Nevada DVD with my fixes in, avoiding the horrors of making one.
I then cd pkg-gate, and run something like this:
$ cat build-ips
export WS=$1
export REPO=http://localhost:$2
unset http_proxy || true
set -e
echo "START `date`"
cd $WS/src
make install packages
cd $WS/src/util/distro-import
export NONWOS_PKGS="/net/paradise/export/integrate_dock/nv/nv_osol0811/all \
/net/paradise/export/integrate_dock/nv/nv_osol0811/i386"
export WOS_PKGS="/mnt/Solaris_11/Product/"
export PYTHONPATH=$WS/proto/root_i386/usr/lib/python2.4/vendor-packages/
export PATH=$WS/proto/root_i386/usr/bin/:$WS/proto/root_i386/usr/lib:$PATH
nohup pkg.depotd -p $2 -d /var/tmp/$USER/repo &
sleep 5
make -e 99/slim_import
echo "END `date`"
$ ./build-ips `pwd` 10023
In fact, since I was running on an older version Nevada (89, precisely), I had to stop after the make install
and change src/pyOpenSSL-0.7/setup.py to pick up OpenSSL from /usr/sfw:
IncludeDirs = [ '/usr/sfw/include' ]
LibraryDirs = [ '/usr/sfw/lib' ]
(If /usr/bin/openssl exists, you don't need this). So, after this step, which build the IPS tools (and SVR4 package for it), it moves into the "distro-import" directory. This is really a completely different thing from IPS itself, but for convenience it lives in the IPS gate. Its job is to take a set of SVR4 packages (that is, the old Solaris package format) and upload them to a given IPS network repository: in this case, http://localhost:10023.
So, making sure we use the IPS tools we just built, we point a couple of environment variables to the package locations. "WOS" stands for, charmingly, "Wad Of Stuff", and in this context means "packages delivered to Solaris Nevada". There's also some extra packages used for OpenSolaris, listed here as NONWOS_PKGS. I'm not sure where external people can get them from, though.
The core of distro-import is the solaris.py script, which does the job of transliterating from SVR4-speak into pkgsend(1)-speak. As well
as a straight translation, though, a small number of customisations to the existing packages are also made to account
for OpenSolaris differences. These are done by dropping the original file contents and picking them up from an ad-hoc SUNWfixes SVR4 package built in the same directory.
Of course, each build has its differences, so they're separated out into sub-directories. As you can see above, to run the import, we make a 99/slim_import target. This basically runs solaris.py for every package listed in the file 99/slim_custer. This list is more or less what makes up the contents of the live CD. Also of interest is the redist_import target, which builds every package available (see http://pkg.opensolaris.org). By the way, watch out for distro-import/README: it's not quite up to date.
Another super useful environment variable is JUST_THESE_PKGS: this will only build and import the packages listed. Very useful if you're tweaking a package and don't want to re-import the whole cluster!
At the end of this build, we now have a populated IPS repository living at http://localhost:10023. If we already have an installed OpenSolaris, we could easily use this to install individual new packages, or do an image update (where ipshost is the remote name of your build machine):
# pkg set-authority -P -O http://ipshost:10023 myipsrepo
# pkg install SUNWmynewpackage # or...
# pkg image-update
If we want to test installer or live CD changes, though, we'll need to build an ISO. I did this for the first time today, and it's fall-over easy. First you need an OpenSolaris build machine, and type:
# pkg install SUNWdistro-const
Modify slim_cd.xml to point to your repository, as described here. It's not immediately obvious, but you can specify your URL as http://ipshost:10023 if you're not using the standard port, like me. Then:
# distro_const build ./slim_cd.xml
And that's it: you'll have a fully-working OpenSolaris ISO in /export/dc_output/ (I understand it's a different location after build 99, though). I never knew building an install ISO could be so simple!
Tags: OpenSolaris
Direct mounting of files
As part of my work on Least
Privilege for xVM, I worked on implementing direct file mounts. The idea is that we'd modify the Solaris support in virt-install
to use these direct mounts, instead of the more laborious older method required.
A long-standing peeve of Solaris users is that in order to mount a file system image (in particular a DVD ISO image),
it's a two-step process. This was less than ideal, as many other UNIX OS's made it simple to do: you'd just pass
the file to the mount command, along with a special option or two, and it mounts it directly.
With my putback of 6384817 Need persistent lofi based mounts and direct mount(1m) support for lofi, this is now possible (in fact, a little easier) in Solaris.
Instead of doing this:
# device=`lofiadm -a /export/solarisdvd.iso`
# mount -F hsfs $device /mnt/iso
...
# umount /mnt/iso
# lofiadm -d /export/solarisdvd.iso
it's just:
# mount -F hsfs /export/solarisdvd.iso /mnt/iso
...
# umount /export/solarisdvd.iso
Under the hood, this still uses the lofi driver, it's just automatically used at mount and unmount time. There's no need for an -o loop option as on Linux.
This is supported for most of the file systems you might need in Solaris, namely ufs, hsfs, udfs, and pcfs. This
doesn't work for ZFS, as this has its own method for mounting file system images.
I was asked a couple of times why I implemented this in the kernel at all (which meant requiring file system support via vfs_get_lofi(). This was primarily to allow non-root users to access file mounts; in fact this was
the primary motivation for implementing this feature from the point of view of the xVM work. In particular, if you have PRIV_SYS_MOUNT, you can do direct file mounts as well as normal mounts. This is important for virt-install, which we want to avoid running as root, but needs to be able to mount DVDs to grab the booting information for when installing a guest.
As always, there's more work that could be done. mount is not smart about relative paths, and should
notice (and correct) early if you try pass a relative path as the first argument. Solaris has always (rather annoyingly) required an -F option to identify what kind of file system you're mounting, which is particularly pedantic of it. Equally the lofi driver doesn't comprehend fdisk or VTOC layouts.
Tags: Xen
OpenSolaris
xVM
blogs.sun.com RSS feed
For reasons beyond my ken, blogs.sun.com doesn't actually list an RSS
feed anywhere I can find, but it's at http://blogs.sun.com/main/feed/entries/rss.
Update:: it's now grown an RSS icon. Thanks!
xVM Under The Hood: seg_mf
An occasional series wherein I'll describe a part of the xVM implementation. Today,
I'll be talking about seg_mf. You may want to read through
my explanation of
live migration and MMU virtualization first.
The control domain (dom0) often needs access to memory pages that belong to a running
guest domain. The most obvious example of this is in constructing the domain during
boot, but it's also needed for mapping the shared virtual guest console page, generating
guest domain core dumps, etc.
This is implemented via the privcmd driver. Each process that needs to map some
area of a guest domain's memory maps a range of anonymous virtual memory. The process then sends a
request to the driver to map in a given range or set of machine frames into the given virtual address
range. The two requests (IOCTL_PRIVCMD_MMAP
and
IOCTL_PRIVCMD_MMAP_BATCH) are more or less the same, although the latter allows
the user to track MFNs that couldn't be mapped (see below).
Both ioctl()s hook into the seg_mf code. This is a normal Solaris segment driver
(see Solaris Internals) with a hook that's used to store the arrays of MFN
values that each VA range is to be backed by. This segment driver is a little unusual though: it
does not support demand faulting. That is, every page in the segment is faulted in (and locked in)
at the time of the ioctl(). This is needed to support the error-reporting interface
described below, but it also helps simplify the driver significantly.
To fault the range, we go through each page-size chunk in the mapping. We need to establish a
mapping from the virtual address of the chunk to the actual machine frame holding the page owned
by the guest domain. This happens in
segmf_faultpage(). The HAT isn't used to our strange request, so we load a temporary
mapping at the given VA, and replace that with a mapping to the real underlying MFN via
HYPERVISOR_update_va_mapping_otherdomain().
Normally, the MFNs given via the ioctl() should be mappable. One exception is
HVM live migration. This was implemented, somewhat confusingly, to use the same interfaces
but pass GMFNs not MFNs. In particular, for HVM guests, a guest MFN (what a guest thinks
is a real machine frame number) is actually a pseudo-physical frame number. As a result,
due to ballooning, or PV drivers, etc., this GMFN may not have a real MFN backing it, so the
attempt to map it will fail. We mark the MFN as failed in the outgoing array of IOCTL_PRIVCMD_MMAP_BATCH
and let the client deal with it. This is generally OK, since the iterative nature of live migration
means we can still get to all the pages we need.
One nice enhancement would be to extend pmap to recognise such mappings. In particular
qemu-dm has a bunch of such mappings. It'd be relatively easy to mark such mappings as
coming from seg_mf. Extra marks for listing the MFN ranges too, though that's a little
harder :)
Tags: Xen
OpenSolaris
xVM
DTrace on xenstored
DTrace support for xenstored
has just been merged in the upstream community version of Xen. Why is it useful?
The daemon xenstored runs in dom0 userspace, and implements a simple 'store' of configuration information.
This store is used for storing parameters used by running guest domains, and interacts with dom0,
guest domains, qemu, xend, and others. These interactions can easily get pretty complicated as a result,
and visualizing how requests and responses are connected can be non-obvious.
The existing community solution was a 'trace' option to xenstored: you could restart the daemon and it would
record every operation performed. This worked reasonably well, but was very awkward: restarting xenstored
means a reboot of dom0 at this point in time. By the time you've set up tracing, you might not be able to reproduce
whatever you're looking at any more. Besides, it's extremely inconvenient.
It was obvious that we needed to make this dynamic, and DTrace USDT (Userspace Statically Defined Tracing) was the
obvious choice. The patch adds a couple of simple probes for tracking requests and responses; as usual, they're activated
dynamically, so have (next to) zero impact when they're not used. On top of these probes I wrote a simple
script called xenstore-snoop. Here's a couple of extracts of the output I get when I start a guest domain:
# /usr/lib/xen/bin/xenstore-snoop
DOM PID TX OP
0 100313 0 XS_GET_DOMAIN_PATH: 6 -> /local/domain/6
0 100313 0 XS_TRANSACTION_START: -> 930
0 100313 930 XS_RM: /local/domain/6 -> OK
0 100313 930 XS_MKDIR: /local/domain/6 -> OK
...
6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/state -> 4
6 0 0 XS_READ: device/vbd/0/state -> 3
0 0 - XS_WATCH_EVENT: /local/domain/6/device/vbd/0/state FFFFFF0177B8F048
6 0 - XS_WATCH_EVENT: device/vbd/0/state FFFFFF00C8A3A550
6 0 0 XS_WRITE: device/vbd/0/state 4 -> OK
0 0 0 XS_READ: /local/domain/6/device/vbd/0/state -> 4
6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/feature-barrier -> 1
6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/sectors -> 16777216
6 0 0 XS_READ: /local/domain/0/backend/vbd/6/0/info -> 0
6 0 0 XS_READ: device/vbd/0/device-type -> disk
6 0 0 XS_WATCH: cpu FFFFFFFFFBC2BE80 -> OK
6 0 - XS_WATCH_EVENT: cpu FFFFFFFFFBC2BE80
6 0 0 XS_READ: device/vif/0/state -> 1
6 0 0 [ERROR] XS_READ: device/vif/0/type -> ENOENT
...
This makes the interactions immediately obvious. We can observe the Xen domain that's doing the request, the PID
of the process (this only applies to dom0 control tools), the transaction ID, and the actual operations performed.
This has already proven of use in several investigations.
Of course this being DTrace, this is only part of the story. We can use these probes to correlate system behaviour:
for example, xenstored transactions are currently rather heavyweight, as they involve copying a large file;
these probes can help demonstrate this. Using Python's DTrace support, we can look at which stack traces in xend correspond to which requests to the store; and so on.
This feature, whilst relatively minor, is part of an ongoing plan to improve the observability and RAS of Xen and the solutions Sun are building on top of it. It's very important to us to bring Solaris's excellent observability features to the virtualization space: you've seen the work with zones in this area, and you can expect a lot more improvements
for the Xen case too.
IRC
I meant to say: after my previous post, I resurrected #opensolaris-dev: if you'd like to talk about OpenSolaris development in a non-hostile environment,
please join!
Tags: Xen
OpenSolaris
xVM
DTrace
#opensolaris
When OpenSolaris got started, #solaris was a channel filled with pointless rants about GNU-this
and Linux-that. Beside complete wrong-headedness, it was a total waste of time and extremely
hostile to new people. #opensolaris, in contrast, was actually pretty nice (for IRC!) - sure,
the usual pointless discussions but it certainly wasn't hateful.
Recently I'm sad to say #opensolaris has become a really hostile, unpleasant place.
I've seen new people arrive and be bullied by a
small number of poisonous
people until they went away (nice own goal, people!). So if anyone's
looking for me for xVM stuff or whatever, I'll be in #onnv-scm or #solaris-xen as usual. And if you
do so, please try to keep a civil tongue in your head - it's not hard.
Xen compatibility with Solaris
Maintaining the compatibility of hardware virtualization solutions can be tricky. Below I'll
talk about two bugs that needed fixes in the Xen hypervisor. Both of them have
unfortunate implications for compatibility, but thankfully, the scope was limited.
Shortly after the release of 3.1.1, we discovered that all 64-bit processes in a Solaris domain
would segfault immediately. After much debugging and head-scratching, I eventually found the problem.
On AMD64, 64-bit processes trap into the kernel via the syscall instruction. Under Xen,
this will obviously trap to the hypervisor. Xen then 'bounces' this back to the relevant OS kernel.
On real hardware, %rcx and %r11 have specific meanings. Prior to 3.1.1, Xen
happened to maintain these values correctly, although the layout of the stack is very different
from real hardware. This was broken in the 3.1.1 release: as a result, the %rflags of each
process was corrupted, and segfaulted almost immediately. We fixed the bug in Solaris, so we would still work with 3.1.1. This was also fixed (restoring the
original semantics) in Xen itself in time for the 3.1.2 release. So there's a small window (early Solaris xVM releases and community versions of Xen 3.1.1)
where we're broken, but thankfully, we caught this pretty early. The lesson to be drawn? Clear documentation of
the hypervisor ABI would have helped, I think.
Around the same time, I noticed during code inspection that we were still setting PT_USER in PTE
entries on 64-bit. This had some nasty implications, but first, some background.
On 32-bit x86, Xen protects itself via segmentation: it carves out the top 64Mb, and refuses to let any
of the domains load a segment selector that allows read or write access to that part of the address space.
Each domain kernel runs in ring 1 so can't get around this.
On 64-bit, this hack doesn't work, as AMD64 does not provide full support for segmentation (given what
a legacy technique it is). Instead, and somewhat unfortunately, we have to use page-based permissions
via the VM system. Since page table entries only have a single bit
("user/supervisor") instead of being able to say "ring 1 can read, but ring 3 cannot",
the OS kernel is forced into ring 3. Normally, ring 3 is used for userspace code. So every time we switch
between the OS kernel and userspace, we have to switch page tables entirely - otherwise, the process could
use the kernel page tables to write to kernel address-space.
Unfortunately, this means that we have to flush the TLB every time, which has a nasty performance cost.
To help mitigate this problem, in Xen 3.0.3, an incompatible change was made. Previously, so that the kernel
(running in ring 3, remember) could access its address space, it had to set PT_USER int its kernel
page table entries (PTEs). With 3.0.3, this was changed: now, the hypervisor would automatically do that.
Furthermore, if Xen did see a PTE with PT_USER set, then it assumed this was a userspace
mapping. Thus, it also set PT_GLOBAL, a hardware feature - if such a bit is set, then a corresponding
TLB entry is not flushed.
This meant that switching between userspace and the OS kernel was much faster, as the TLB entries for userspace
were no longer flushed.
Unfortunately, in our kernel, we missed this change in some crucial places, and until we fixed the bug above,
we were setting PT_USER even on kernel mappings. This was fairly obviously A Bad Thing: if you caught
things just right, a kernel mapping would still be present in the TLB when a user-space program was running,
allowing userspace to read from the kernel! And indeed, some simple testing showed this:
dtrace -qn 'fbt:genunix::entry /arg0 > `kernelbase/ { printf("%p ", arg0); }' | \
xargs -n 1 ~johnlev/bin/i386/readkern | while read ln; do echo $ln::whatis | mdb -k ; done
With the above use of DTrace, MDB, and a little program that attempts to read addresses,
we can see output such as:
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01c8c98438 is ffffff01c8c983e8+50, bufctl ffffff01c8ebf8d0 allocated from as_cache
ffffff01d6f09c00 is ffffff01d6f09c00+0, allocated as a thread structure
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
ffffff01d44d7e80 is ffffff01d44d7e80+0, bufctl ffffff01d3a2b388 allocated from kmem_alloc_40
Thankfully, the fix was simple: just stop adding PT_USER to our kernel PTE entries. Or so I thought.
When I did that, I noticed during testing that the userspace mappings weren't getting PT_GLOBAL
set after all (big thanks to MDB's ::vatopfn, which made this easy to see).
Yet more investigation revealed the problem to be in the hypervisor. Unlike certain other popular
OSes used with Xen, we set PTE entries in page tables using atomic compare and swap operations.
Remember that under Xen, page tables are read-only to ensure safety. When an OS kernel tries
to write a PTE, a page fault happens in Xen. Xen recognises the write as an attempt to update
a PTE and emulates it. However, since it hadn't been tested, this emulation path was broken:
it wasn't doing the correct mangling of the PTE entry to set PT_GLOBAL. Once again,
the actual fix was
simple.
By the way, that same putback also had the implementation of:
I'd been doing an awful lot of paging through ::threadlist
output recently, and always having to jump through all the (usually
irrelevant) taskq threads was driving me insane. So now you can
just specify ::threadlist -t and get a much, much, shorter list.
Tags: Xen
OpenSolaris
xVM
OpenSolaris xVM now available in SX:CE
Build 75 of Solaris Express Community Edition is now out,
and it includes our bits. So go
ahead, install build 75, select the xVM entry in grub and play around! We're still working on updating the documentation
on our community page; in the meantime, you have manpages - start at xVM(5) (and note that the forthcoming build 76 has much improved versions of those docs).
You might be wondering if your machine is capable of running Windows or other operating systems under HVM. Joe Bonasera
has a simple program
you can run that will tell you. Alternatively, if you're already running with our bits, running 'virt-install' will
tell you - if it asks you about creating a fully-virtualized domain, then it should work, and you can end up with a
desktop like Russell Blaine's.
Nils, meanwhile, describes how we've improved the RAS of the hypervisor by integrating it with Solaris crash dumps
here. This feature has saved our lives numerous times during development as those of us who've done the "hex dump" debugging thing know very well.
Of course, we're not done yet - we have bugs to fix and rough edges to smooth out, and we have significant features to implement. One of the major items we're working on in the near future is the upgrade to Xen 3.1.1 (or possibly 3.1.2, depending on timelines!). This will give us the ability to do live migration of HVM domains, along with a host of other features and improvements.
Tags: Xen
OpenSolaris
xVM
Automatic start/stop of Xen domains
After answering a query, I
said I'd write a blog entry describing what changes we've made to support clean shutdown and start of Xen domains.
Bernd refers to an older method of auto-starting Xen domains used on Linux. In fact, this method has been replaced
with the configuration parameters on_xend_start and on_xend_stop. Setting these can ensure
that a Xen domain is cleanly shut down when the host (dom0) is shut down, and started automatically as needed. For
somewhat obvious reasons, we'd like to have the same semantics as used with zones, if not quite the same implementation
(yet, at least).
When I started looking at this, I realised that the community solution had some problems:
Clean shutdown wasn't the default
It seems obvious that by default I'd like my operating systems to shut down cleanly. Only in unusual circumstances would
I be happy with an OS being unceremoniously destroyed. We modified our Xen gate to default to on_xend_stop=shutdown.
Suspend on shutdown was dangerous
It is possible to specify on_xend_stop=suspend; this will save the running state to an image file and then destroy the domain (like xm save). However, there is not corresponding on_xend_start setting, nor
any logic to ensure that the values match. This is both apparently useless and even dangerous, since starting a new
domain but with old file-system state from a suspended domain could be problematic. We've disabled this functionality.
Actions are tied into xend
This was the biggest problem for us: as modelled, if somebody stops xend, then all the domains would be shut down. Similarly, if xend restarts for whatever reason (say, a hardware error), it would start domains again.
We've modified this on Solaris. Instead of xend operating on these values, we introduce a new SMF service,
system/xctl/domains,
that auto-starts/stops domains as necessary.
This service is pretty similar to system/zones. We've set up the dependencies such that a restart
of the Xen daemons won't cause any running domains to be restarted. For this to work properly within the SMF
framework, we also had to modify xend to wait for all domains to finish their state transitions.
You can find our changes here. And yes,
we still need to take system/xctl/domains to PSARC.
Clean shutdown implementation
You might be wondering how the dom0 even asks the guest domains to shut down cleanly. This is done via a xenstore
entry, control/shutdown. The control tools write a string into this entry, which is being "watched" by
the domain. The kernel then reads the value and responds appropriately
(xen_shutdown()),
triggering a user-space script via the sysevent framework. If nothing happens for a while, it's possible that the script couldn't run
for whatever reason. In that case, we time-out and force a "dirty" shutdown from within the kernel.
Tags: Xen OpenSolaris
Solaris Xen update
After an undesirably long time, I'm happy to say that another drop of Solaris on Xen is
available here.
Sources and other sundry parts are here.
Documentation can
be found at our community site, and
you can read
Chris Beal describe how to
get started with the new bits.
As you might expect, there's been a massive amount of change
since the last OpenSolaris release.
This time round, we are based on Xen 3.0.4 and build 66 of Nevada. As always, we'd love to hear about
your experiences if you try it out, either on the mailing list or the IRC channel.
In many ways, the most significant change is the huge effort we've put in to stabilize our codebase; a
significant number of potential hangs, crashes, and core dumps have been resolved, and we hope we're
converging on a good-quality release. We've started looking seriously at performance issues, and filling
in the implementation gaps. Since the last drop, notable improvements include:
- PAE support
-
By default, we now use PAE mode on 32-bit, aiding compatibility with other domain 0 implementations; we also
can boot under either PAE or non-PAE, if the Xen version has 'bi-modal' support. This has probably been the
most-requested change missing from our last release.
- HVM support
-
If you have the right CPU, you can now run fully-virtualized domains such as Windows using a Solaris dom0! Whilst
more work is needed here, this does seem to work pretty well already. Mark Johnson has some useful tips on using HVM domains.
- New management tools
-
We have integrated the virt- suite of management tools. virt-manager provides
a simple GUI for controlling guest domains on a single host. virt-install and virsh are simple CLIs
for installing and managing guest domains respectively. Note that parts of these tools are pre-alpha, and we still
have a significant amount of work to do on them. Nonetheless, we appreciate any comments...
- PV framebuffer
-
Solaris dom0 now supports the SDL-based paravirt framebuffer backend, which can be used with domUs that have PV framebuffer support.
- Virtual NIC support
-
The Ethernet bridge used in the previous release has been replaced with virtual NICs from the
Crossbow project. This enables future work
around smart NICs, resource controls, and more.
- Simplified Solaris guest domain install
-
It's now easy to install a new Solaris guest domain using the DVD ISO. The temporary tool in the last release,
vbdcfg, has disappeared now as a result. William Kucharski has a walk-through.
- Better SMF usage
-
Several of the xend configuration properties are now controlled using the SMF framework.
- Managed domain support
-
We now support xend-managed domain configurations instead of using .py configuration files. Certain
parts of this don't work too well yet (unfortunately all versions of Xen have similar problems), but we are
plugging in the gaps here one by one.
- Memory ballooning support
- Otherwise known as support for dynamic xm mem-set, this allows much greater flexibility in partitioning
the physical memory on a host amongst the guest domains. Ryan Scott has more details.
- Vastly improved debugging support
-
Crash dump analysis and debugging tools have always been a critical feature for Solaris developers. With this release,
we can use Solaris tools to debug both hypervisor crashes and problems with guest domains. I talk a little bit about
the latter feature below.
- xvbdb has been renamed
-
To simply be xdb. This was a very exciting change for certain members of our team.
We're still working hard on finishing things up for our phase 2 putback into Nevada (where "phase 1"
was the separate dboot putback). As well as
finishing this work, we're starting to look at further enhancements, in particular some features that are available
in other vendors' implementations, such as a hypervisor-copy based networking device, blktap support,
para-virtualized drivers for HVM domains (a huge performance fix), and more.
Debugging guest domains
Here I'll talk a little about one of the more minor new features that has nonetheless proven very useful.
The xm dump-core command generates an image file of a running domain. This file is a dump of all
memory owned by the running domain, so it's somewhat similar to the standard Solaris crash dump files.
However, dump-core does not require any interaction with the domain itself, so we can grab
such dumps even if the domain is unable to create a crash dump via the normal method (typically, it hangs
and can't be interacted with), or something else prevents use of the standard Solaris kernel debugging facilities
such as kmdb (an in-kernel debugger isn't very useful if the console is broken).
However, this also means that we have no control over the format used by the image file. With Xen 3.0.4,
it's rather basic and difficult to work with. This is much improved in Xen 3.1, but I haven't yet written
the support for the new format.
To add support for debugging such image files of a Solaris domain, I modified mdb(1) to understand the format
of the image file (the alternative, providing a conversion step, seemed unneccessarily awkward, and would have had to
throw away information!). As you can see if you look around usr/src/cmd/mdb in the source drop,
mdb(1) loads a module called mdb_kb when debugging such image files. This provides simple methods for
reading data from the image file. For example, to read a particular virtual address, we need to use the contents of
the domain's page tables in the image file to resolve it to a physical page, then look up the location of that page
in the file. This differs considerably from how libkvm works with Solaris crash dumps: there, we have a
big array of address translations, which is used directly, instead of the page table contents.
In most other respects, debugging a kernel domain image is much the same as a crash dump:
# xm dump-core solaris-domu core.domu
# mdb core.domu
mdb: warning: dump is from SunOS 5.11 onnv-johnlev; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace xpv_psm scsi_vhci ufs ... sppp ptm crypto md fcip logindmux nfs ]
> ::status
debugging domain crash dump core.domu (64-bit) from sxc16
operating system: 5.11 onnv-johnlev (i86pc)
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc4b7f0 1b 40 9 169 yes yes t-1408926 ffffff00010bfc80 sched
> ::evtchns
Type Evtchn IRQ IPL CPU ISR(s)
evtchn 1 257 1 0 xenbus_intr
evtchn 2 260 9 0 xenconsintr
virq:debug 3 256 15 0 xen_debug_handler
virq:timer 4 258 14 0 cbe_fire
evtchn 5 259 5 0 xdf_intr
evtchn 6 261 6 0 xnf_intr
evtchn 7 262 6 0 xnf_intr
> ::cpustack -c 0
cbe_fire+0x5c()
av_dispatch_autovect+0x8c(102)
dispatch_hilevel+0x1f(102, 0)
switch_sp_and_call+0x13()
do_interrupt+0x11d(ffffff00010bfaf0, fffffffffbc86f98)
xen_callback_handler+0x42b(ffffff00010bfaf0, fffffffffbc86f98)
xen_callback+0x194()
av_dispatch_softvect+0x79(a)
dispatch_softint+0x38(9, 0)
switch_sp_and_call+0x13()
dosoftint+0x59(ffffff0001593520)
do_interrupt+0x140(ffffff0001593520, fffffffffbc86048)
xen_callback_handler+0x42b(ffffff0001593520, fffffffffbc86048)
xen_callback+0x194()
sti+0x86()
_sys_rtt_ints_disabled+8()
intr_restore+0xf1()
disp_lock_exit+0x78(fffffffffbd1b358)
turnstile_wakeup+0x16e(fffffffec33a64d8, 0, 1, 0)
mutex_vector_exit+0x6a(fffffffec13b7ad0)
xenconswput+0x64(fffffffec42cb658, fffffffecd6935a0)
putnext+0x2f1(fffffffec42cb3b0, fffffffecd6935a0)
ldtermrmsg+0x235(fffffffec42cb2b8, fffffffec3480300)
ldtermrput+0x43c(fffffffec42cb2b8, fffffffec3480300)
putnext+0x2f1(fffffffec42cb560, fffffffec3480300)
xenconsrsrv+0x32(fffffffec42cb560)
runservice+0x59(fffffffec42cb560)
queue_service+0x57(fffffffec42cb560)
stream_service+0xdc(fffffffec42d87b0)
taskq_d_thread+0xc6(fffffffec46ac8d0)
thread_start+8()
Note that both ::cpustack and ::cpuregs are capable of using the actual register set at
the time of the dump (since the hypervisor needs to store this for scheduling purposes). You can also
see the ::evtchns dcmd in action here; this is invaluable for debugging interrupt problems (and
we've fixed a lot of those over the past year or so!).
Currently, mdb_kb only has support for image files of para-virtualized Solaris domains. However,
that's not the only interesting target: in particular, we could support mdb in live
crash dump mode against a running Solaris domain, which opens up all sorts of interesting debugging
possibilities. With a small tweak to Solaris, we can support debugging of fully-virtualized Solaris instances.
It's not even impossible to imagine adding Linux kernel support to mdb(1), though it's hard to imagine there
would be a large audience for such a feature...
Tags: Xen OpenSolaris
|