In my
previous post, I went over some of the goals we are trying to attain
though a data-center architecture based on virtualization with xVM
Xen. Essentially looking for ways to work smarter, faster and be more
flexible. In this entry, I will attempt to go into the details of the
infrastructure we built.
Network
Layout
A good
place to start is the general network layout. The first diagram shows
what we have done. Nothing too fancy. At the heart of it is two
independent physical networks. We do this to increase availability.
Two load balancers are at the top of the stack, which are cross
connected to switches on each physical network, and do something like
talk VRRP or some other crazy protocol to each other to decide who is
the master and who is on standby.
There are
two logical networks that run throughout the infrastructure, the
Public network and the Backend network. If the names aren't clear
enough, the Public network is meant to serve traffic between us and
our customers (via the internet.) The Backend network is for various
favors of server-to-server communication within the data-center.
Each
physical host, therefore, has two network connections, one to each
logical network. The notable exception is the Sun Storage Unified
Storage 7410 Cluster, which is connected only to the Backend network,
and has connections to both physical networks. The cluster is
configured in an active-passive mode, which means that only one of
the two 7410 head units is doing the file serving at any one time. In
order to ensure that it can still serve stuff to the Even segment
even though the Odd segment is down, we need to give it a presence on
each physical segment. I'll discuss more about what specifically
we're doing with the Unified Storage Cluster later on.
Install
Servers
Nothing
really out of the ordinary here. We've set up a pair of PXE
Boot/Install servers to handle mainly installing additional
Hypervisors. The Install servers themselves are running Solaris
Nevada 105, I don't recall what that translates to in terms of SXCE
releases. Nothing special about that version other than what happened
to be the more recent drop we could get when we started assembling
everything.
These are
more or less independent of each other. Most of the installs work
against the primary install server. The only real reason to switch
the other server is if the primary is down or if its network segment
is down. But as you'll see, we're really not doing many installs at
all.
DHCP
Servers
The
DHCP servers actually run on the same physical hosts as the install
servers. They are using the DHCP daemons that come as part of
Solaris/Nevada. They do play a key part in the whole PXE Boot
process, so they have had additional customizationsmade to allow for
that â\u20ac\u201c there are a few good articles out there describing the
additional macros to accomplish PXE Boot.
The
one other interesting thing to note is that the pair of DHCP servers
are configured as a cluster. The DHCP daemon doesn't really
have a notion of being in a cluster, however, it can be configured to
use a shared datastore, which each daemon can read and update. In our
case, this is the first use of the Unified Storage Cluster. The DHCP
servers are configured to use an NFS share on the cluster. The
caveat here, is that you must configure the daemons to use the
SUNWfiles datasource. Neither
SUNWbinfiles nor SUNWnisplus
will work.
A
couple of other things to be aware of when clustering DHCP is
to make sure you set the OWNER_IP
parameter in each daemon's local dhcpsvc.conf
file to a comma separated list of all the IP addresses of all the
interfaces that will serve DHCP requests on all servers. Also make
sure you set the RESCAN_INTERVAL
to a reasonable value for you, in our case we just set it to 1
minute. Both of these values can be updated with the dhcpconfig
-P command.
dhcpconfig -P OWNER_IP=192.168.78.14,192.168.78.16,192.168.76.25,192.168.76.25
dhcpconfig -P RESCAN_INTERVAL=1
The
Hypervisors
First
things first. The Hypervisors are all well loaded Sun Fire X4150
servers. They have been installed with Solaris/Nevada 105, again just
because that was current at the time. We chose Nevada instead of
OpenSolaris primarily because it offers a better unattended/headless
install environment. They are configured with ZFS root on local hard
drives, and the ZFS mirroring has been set up to increase uptime.
Not
much in the way of customizations to Dom-0 or the Hypervisor.
Obviously in Dom-0 we disable as many unnecessary services as
possible. Since these are servers, we shut down all of the Desktop
related services like the X server and the like.
We
also limit Dom-0 memory to only 2G using the dom0_mem
parameter to the hypervisor. This might be a little aggressive since
ZFS is memory hungry, but we want to try to keep as much memory
available for the guest domains as possible, and we haven't seen a
problem with this yet.
We
also set the hypervisor console to com1, in case we need to break
into the console for any sort of debugging (knock on wood we don't
have to do that.)
Both
these parameters are set from the GRUB boot commands
kernel$
/boot/$ISADIR/xen.gz com1=9600,8n1 console=com1 dom0_mem=2G
We
also use the vanity device naming capabilities of Solaris - via dladm
- to give all the Public and Backend interfaces the same names. So,
although today we're using Sun Fire X4150 servers which all have
Intel Pro/1000 network interface controllers in them, in the future,
we can move to different platforms and still maintain a consistent
device naming convention. This is actually pretty crucial for Xen
Live Migration to work. Xen needs the same network interfaces to be
available on the source and destination of the Live Migration.
Without this, the migration will not succeed.
dladm rename-link e1000g0 fe0
dladm rename-link e1000g1 be0
Finally,
while we're talking about Live Migration, its something we need to
enable in xend. A couple of
simple SMF changes handle that.
svccfg -s xend setprop config/xend-relocation-address = 192.168.78.21
svccfg -s xend setprop config/xend-relocation-hosts-allow = astring: \
\"^localhost$^192\.168\.78\.[0-9]*$\"
For
what its worth, we currently have 22 Hypervisors in the cloud,
clearly not yet a huge deployment, yet.
Unified
Storage Cluster
These
units sort of fell in our lap at just the right time. Although we
could have approached the level of availability they offer with some
solution built on top of Solaris & SunCluster, the ease of
installation, configuration and maintenance the the Unified Storage
Cluster affords should help keep the infrastructure simple and
straight forward to maintain.
As
I mentioned earlier, these units are configured in an active-passive
cluster mode, and have a presence on each physical network segment.
Configuring them as a cluster is amazingly simple, as any appliance
should be. Once the CLUSTRON interface is connected between them, the
initial boot to configure the first head node automatically detects
the presence of the second head node, and prompts you to configure
them as a cluster.
As
far as how we are using them, well, they are used in a few
capacities. First, as I mentioned earlier, they are used to house the
DHCP servers' shared datastore. We also use them to house various
administrative tools and bits.
But
that's not the main use. Primarily we are using them as the virtual
disk in which all the guest operating systems (virtual servers) are
installed. This is done by creating LUNs in the Unified Storage and
exposing them as iSCSI targets, which are then attached to the
Dom-0's and made available to the Hypervisors.
This
is another critical piece that makes some of the cool features of xVM
Xen, like Live Migration, possible. Live Migration is the process of
moving a running virtual server from one physical host to another.
Did you read that, a running virtual server!
For this to happen, the virtual disk must be available on the target
physical host. Using iSCSI makes this a snap since all you need to do
is attach the LUN to the target physical host and you are done. If
you think of the alternative with local storage, you would have to
somehow transfer the bits from the local storage on the source
physical host to the target, while the virtual server is still
running, which among other things is nearly impossible to do in a
real-time manner.
Ok, so one other use of the Unified
Storage is for shared storage for applications running inside the
virtual servers. Building horizontally scalable, redundant
applications sometimes requires the use of storage that can be
accessed by all nodes in the application cluster. We provide this on
the Unified Storage cluster with NFS.
Putting
it All Together
That's
a summary of all the pieces, now how does it all fit together. Here's
a diagram that shows all the interactions.
Lets
look at the organization on the Unified Storage Cluster first. We use
different projects within the Unified Storage for keep things
manageable. First thing we've done is build a set of master images of
various guest operating systems like OpenSolaris - 08.11 & Dev,
Nevada, and so forth. Those images are kept in a Masters project.
These images are pre-installed instances of guest operating systems
that reside on (iSCSI) LUNs in the Unified Storage. A snapshot of the
image is taken after any revision we make to the image (note that
snapshots should only be taken when the guest has been shutdown.)
When
we're ready to spin up a new virtual machine, we select the current
snapshot of the O/S image we want, and clone it as a new LUN in a
separate Unified Storage project. The great thing here, is that the
clones initially take up zero additional space, and only start to use
their own space when the operating system changes anything on the
virtual disk, or an application is installed, and the like. Quite a
savings in disk space. We have some scripts that interact with the
Unified Storage to do this.
These commands will
create a new project called appl-1,
and then clone a master image to the project as vm-1.
domu-project appl-1
domu-clone masters/osol-0811@version-01 appl-1/vm-1
After
the image is cloned, we instruct the Unified Storage to export the
cloned LUN as an iSCSI target. The target is then attached to a Dom-0
and Hypervisor. Currently we've been putting about 4-6 guests on each
Hypervisor before moving onto a new one for additional virtual
servers.
From
here, the usual Xen commands are used to define a domain, that is xm
create or virsh create
commands. As part of the domain configuration, we specify the
attached iSCSI LUN for the virtual disk for the domain. Again, this
is simplified though scripting by using a pre-defined XML template
for domain creation.
This example shows the creation of a paravirtualized guest - pv, which has both a Public and Backend interface - fe-be, on the Odd network segment - odd, with 1G of memory - 1024, and 1 CPU - 1.
domu-init pv fe-be odd 1024 1 appl-1/vm-1
I
should point out that we're not really pre-attaching the iSCSI LUN to
Dom-0, but instead using one of the enhancements of xVM Xen which
will do this for us. We simply specify the iSCSI Qualified Name (IQN)
and the IP address of the target as the virtual disk in the domain
configuration, and let xVM Xen deal with attaching it when the domain
starts up.
Now,
we're almost there, but we need to assign IP addresses to the virtual
server. We use DHCP for this. More specifically we bind specific IP
addresses to specific MAC addresses to ensure that a virtual server
always uses its assigned address(es). Xen creates a sort of synthetic
MAC address for each interface it configures and persists the
address. We can grab these addresses before the domain starts to
update DHCP.
This
is a good time to point out that all the master images of guest
operating systems have been configured as DHCP clients. This really
simplifies the entire process, since there is no need to do any
post-configuration of the cloned images to give them the correct
network resources. With DHCP, it just happens. You can also see why
it is critical to have a highly available DHCP cluster, since
so much relies on it. Once again, this is scripted.
This example shows the assignment of a Public and Backend interface to the new guest
domu-assign-ip appl-1:vm-1 fe0 vm-host1 192.168.76.71
domu-assign-ip appl-2:vm-1 be0 vm-host1-be 192.168.78.71
And
that's it. At this point, we're ready to fire up the guest.
virsh start appl-1:vm-1
virsh console appl-1:vm-1
Wrap Up
That's
about it for the basic details of what we've done. Its all been
working incredibly well, especially considering we're running about ¾
development code everywhere.
Before
I forget, I would really like to thank the xVM Xen team for their
support while we've been setting things up and tinkering around, they
have all been vary helpful and responsive to my questions on
xen-discuss@opensolaris.org
as well as private threads. Mark Johnson deserves a special mention
since I glommed onto him the most.
Up
next, a summary of how well we're doing on the previously outlined
goals.