Today we released our current source tree for our Solaris Xen port; for more details
and the downloads see the Xen community on OpenSolaris.
One of the most useful features of Xen is its ability to package up a running
OS instance (in Xen terminology, a "domainU", where "U" stands for
"unprivileged"), plus all of its state, and take it offline, to be resumed at a
later time. Recently we performed the first successful live migration of a
running Solaris instance between two machines. In this blog I'll cover the
various ways you can do this.
Para-virtualisation of the MMU
Typical "full virtualisation" uses a method known as "shadow page tables",
whereby two sets of pagetables are maintained: the guest domain's set,
which aren't visible to the hardware via cr3, and page tables visible to the
hardware which are maintained by the hypervisor. As only the hypervisor can
control the page tables the hardware uses to resolve TLB misses, it can
maintain the virtualisation of the address space by copying and validating any
changes the guest domain makes to its copies into the "real" page tables.
All these duplicates pages come at a cost of course. A para-virtualisation
approach (that is, one where the guest domain is aware of the virtualisation
and complicit in operating within the hypervisor) can take a different tack. In
Xen, the guest domain is made aware of a two-level address system. The domain
is presented with a linear set of "pseudo-physical" addresses comprising the
physical memory allocated to the domain, as well as the "machine" addresses for
each corresponding page. The machine address for a page is what's used in the
page tables (that is, it's the real hardware address). Two tables are used to
map between pseudo-physical and machine addresses. Allowing the guest domain
to see the real machine address for a page provides a number of benefits, but
slightly complicates things, as we'll see.
Save/Restore
The simplest form of "packaging" a domain is suspending it to a file in the
controlling domain (a privileged OS instance known as "domain 0"). A running
domain can be taken offline via an xm save command, then restored at a later
time with xm restore, without having to go through a reboot cycle - the domain
state is fully restored.
xm save xen-7 /tmp/domain.img
An xm save notifies the domain to suspend itself. This arrives via the
xenbus watch system on the node control/shutdown, and is handled via
xen_suspend_domain(). This is actually remarkably simple. First we leverage
Solaris's existing suspend/resume subsystem, CPR, to iterate through the
devices attached to the domain's device nexus. This calls each of the virtual
drivers we use (the network, console, and block device frontends) with a
DDI_SUSPEND argument. The virtual console, for example, simply removes its
interrupt handler in xenconsdetach(). As a guest domain, this tears down the
Xen event channel used to communicate with the console backend. The rest of the
suspend code deals with tearing down some of the things we use to communicate
with the hypervisor and domain 0, such as the grant table mappings.
Additionally we convert a couple of stored MFN (the frame numbers of machine
addresses) values into pseudo-physical PFNs. This is because the MFNs are free to
change when we restore the guest domain; as the PFNs aren't "real", they will
stay the same. Finally we call HYPERVISOR_suspend() to call into the hypervisor
and tell it we're ready to be suspended.
Now the domain 0 management tools are ready to checkpoint the domain to the
file we specified in the xm save command. Despite the name, this is done via
xc_linux_save(). Its main task is to convert any MFN values that the domain
still has into PFN values, then write all its pages to the disk. These MFN
values are stored in two main places; the PFN->MFN mapping table managed by the
domain, and the actual pages of the page tables.
During boot, we identified which pages store the PFN->MFN table (see
xen_relocate_start_info()), and pointed to that structure in the "shared info"
structure, which is shared between the domain and the hypervisor. This is
used to map the table in xc_linux_save().
The hypervisor keeps track of which pages are being used as page tables. Thus,
after domain 0 has mapped the guest domain's pages, we write out the page
contents, but modify any pages that are identified as page tables. This is
handled by canonicalize_pagetable(); this routine replaces all PTE entries that
contain MFNs with the corresponding PFN value.
There are a couple of other things that need to be fixed too, such as the GDT.
xm restore /tmp/domain.img
Restoring a domain is essentially the reverse operation: the data for each page
is written into one of the machine addresses reserved for the "new" domain; if
we're writing a saved page table, we replace each PTE's PFN value with the new
MFN value used by the new instance of the domain.
Eventually the restored domain is given back control, coming out from the
HYPERVISOR_suspend() call. Here we need to rebuild the event channel setup, and
anything else we tore down before suspending. Finally, we return back from the
suspend handler and continue on our merry way.
Migration
xm migrate xen-7 remotehost
A normal save/restore cycle happens on the same machine, but migrating a domain
to a separate machine is a simple extension of the process. Since our save
operation has replaced any machine-specific frame number value with the
pseudo-physical frames, we can easily do the restore on a remote machine,
even though the actual hardware pages given to the domainU will be different. The
remote machine must have the Xen daemon listening on the HTTP port, which is a
simple change in its config file. Instead of writing each page's contents to a
file, we can transmit it across HTTP to the Xen daemon running on a remote
machine. The restore is done on that machine in the same manner as described
above.
Live Migration
xm migrate --live xen-7 remotehost
The real magic happens with live migration, which keeps the time the domain
isn't kept running to a bare minimum (on the order of milliseconds). Live
migration relies on the empirically observed data that an OS instance is
unlikely to modify a large percentage of its pages within a certain time frame;
thus, by iteratively copying over modified domain pages, we'll eventually reach
a point where the remaining data to be copied is small enough that the actual
downtime for a domainU is minimal.
In operation, the domain is switched to use a modified form of the shadow page
tables described above, known as "log dirty" mode. In essence, a shadow page
table is used to notify the hypervisor if a page has been written to, by
keeping the PTE entry for the page read-only: an attempt to write to the page
causes a page fault. This page fault is used to mark the domain page as "dirty"
in a bitmap maintained by the hypervisor, which then fixes up the domain's page
fault and allows it to continue.
Meanwhile, the domain management tools iteratively transfers unmodified pages
to the remote machine. It reads the dirty page bitmap and re-transmits any page
that has been modified since it was last sent, until it reaches a point where
it can finally tell the domain to suspend, and switch over to running it on the
remote machine. This process is described in more detail in Live
Migration of Virtual Machines.
Whilst transmitting all the pages takes a while, the actual time between
suspension and resume is typically very small. Live migration is pretty fun to
watch happen; you can be logged into the domain over ssh and not even notice
that the domain has migrated to a different machine.
Further Work
Whilst live migration is currently working for our Solaris changes, there's
still a number of improvements and fixes that need to be made.
On x86, we usually use the TSC register as the basis for a high-resolution
timer (heavily used by the microstate accounting subsystem). We don't directly
use any virtualisation of the TSC value, so when we restore a domain, we can
see a large jump in the value, or even see it go backwards. We handle this OK
(once we fixed bug 6228819
in our gate!), but don't yet properly handle the
fact that the relationship between TSC ticks and clock frequency can change
between a suspend and resume. This screws up our notion of timing.
We don't make any effort to release physical pages that we're not currently
using. This makes suspend/resume take longer than it should, and it's probably
worth investigating what can be done here.
Currently many hardware-specific instructions and features are enabled at boot
by patching in instructions if we discover the CPU supports it. For example we
discovered a domain that died badly when it was migrated to a host that didn't
support the sfence instruction. If such a kernel is migrated to a machine with
different CPUs, the domain will naturally fail badly. We need to investigate
preventing incompatible migrations (the standard Xen tools currently do no
verification), and also look at whether we can adapt to some of these changes
when we resume a domain.
Tags: OpenSolaris Xen
Trackback URL: http://blogs.sun.com/levon/entry/live_migration_of_solaris_instances
Posted by Peter on April 20, 2006 at 07:24 AM BST #