The HyperTrap
Alexandre Chartre's Weblog
Archives
« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today
Links
All | General | LDoms
20081224 Wednesday December 24, 2008
LDoms 1.1
It's been a while I haven't added a new entry to this blog, but there was a good reason: we have been working very hard on a new release of the LDoms product. This new release is now available as LDoms 1.1:

LDoms 1.1 Features:

LDoms 1.1 Usage Examples

With this release, LDoms are now more flexible, dynamic and have enhanced virtual I/O performances. This is particularly useful when you have a system like the Sun SPARC Entreprise T5440 server where you can have up to 256 cpu threads and 512GB of memory. Here are some examples of the benefit of using some of the new features of LDoms 1.1 on such a large and powerful system.
Sun SPARC Entreprise T5440 Server Overview
On the Sun SPARC Entreprise T5440 server you can create up to 128 domains; but the configuration you need will mostly depend on the number of cpus and memory you want to use for each domain. Here are some configuration examples:

Num. of DomainsCPUs/DomainMemory/Domain Comments
12824GB each has two cpu threads
32816GB each domain can have an entire cpu core (1 core has 8 threads)
464128GB each domain can have an entire cpu chip (1 chip has 8 cores)
1256512GB the domain has all 4 cpu chips (4 chips x 8 cores x 8 threads = 256)

Of the course the number of domains and the associated number of cpus and memory can be adjusted depending on your need.

The T5440 also has 4 PCI buses, that means that you can create up to 4 I/O domains. Such domains can have direct access to physical I/O resources, like a network card or a physical disk. So you can effectively split the T5440 into 4 fully independent systems (for example 4 domains, each with 64 cpu threads and 128GB of memory) which have their own physical disks and network interfaces and do not depend on any other domain because they don't need to use virtual I/O.

Domain Migration

With such a large number of domains (up to 128), it is important to be able to easily move a domain from one system to another system in case you need to shutdown the entire platform (for example during a maintenance), or if you have some new hardware available and you want to free some resources on an existing system (for example to allocate more cpus or memory to the existing domains). You can easily do this with the Domain Migration feature of LDoms 1.1. Migration is done with a single command:

        primary# ldm migrate domain_to_migrate system_to_migrate_to
Then the system will automatically select the appropriate type of migration depending of the state of the domain to migrate:
Network NIU Hybrid I/O
To use network NIU hybrid I/O, you need an UltraSPARC-T2 based system, like the T5120 or the T5220, and some 10-Gigabit Ethernet XAUI Adapters. An XAUI adapter is managed by the nxge driver and is assigned to an I/O domain. It can also be associated with a virtual switch (vsw). Then the virtual network interface (vnet) of a guest domain connected to a virtual switch associated with an XAUI adapter can operate in "hybrid" mode. In such a mode, the guest will be able to receive and send network packets to the XAUI adapter without having to go through the virtual switch or the I/O domain. That way a guest domain is able to get the performance a physical network interface while using a virtual network interface.

The figure below illustrates the difference between virtual I/O and hybrid I/O by showing the path of the network packets in the different modes:

Click on the image to enlarge

Configuring a virtual network interface in hybrid mode is very simple: just specify "mode=hybrid" when adding or setting a virtual network interface. For example:

        primary# ldm add-vnet mode=hybrid vnet0 primary-vsw0 ldg1
or
        primary# ldm set-vnet mode=hybrid vnet0 ldg1
Note the setting "mode=hybrid" is just an hint to the system so that it tries to use hybrid I/O. If the system is unable to use the hybrid mode (for example because the virtual switch is not associated with an XAUI adapter) then the system will automatically fallback to the legacy mode and use virtual I/O.

The T5120 and T5220 can have up to 2 XAUI adapters, and each XAUI adapter can be shared 3 times in hybrid mode. That means that you can have up 6 domains having a virtual network interface using hybrid I/O.

See Raghuram's blog for more details about Network Hybrid I/O.

Virtual Disk Failover and Multipathing
When you have a domain using virtual I/O, one concern is to keep that domain up and running even if the service domain providing virtual I/O services happens to be down. LDoms already provide a solution to that problem for virtual network I/O, by using IP Multipathing (IPMP) in the guest domain on top of two virtual network interface associated with a virtual switches from different service domains.

In addition, LDoms 1.1 provides a solution to that problem for virtual disk I/O. If multiple service domains have access to the same virtual disk backend (for example a file on a NFS server, or a shared LUN on a SAN) then a virtual disk can be associated with all these service domains and the path to access the virtual disks backend will change depending on the availability of the service domains.

Virtual disk multipathing is configured by putting the vdsdev representing the same virtual disk backend into the same multipathing group (mpgroup). This is done when using the add-vdsdev command. For example, if we have two service domains (primary and alternate), each with a vds service (primary-vds0 and alternate-vds0), and each service domain is able to access the same NFS file /home/domain/ldg1/vdisk0, then we can put that backend file into the same mpgroup "foo".

        primary# ldm add-vdsdev mpgroup=foo /home/domain/ldg1/vdisk0 vdisk0@primary-vds0
        primary# ldm add-vdsdev mpgroup=foo /home/domain/ldg1/vdisk0 vdisk0@alternate-vds0
Finally the backend file can be exported as a virtual disk to the domain ldg1:
        primary# ldm add-vdisk vdisk0 vdisk0@primary-vds0 ldg1
That way the virtual disk will be accessible in domain ldg1, primarily through the primary domain. But if the primary domain goes down then the virtual disk will remain accessible through the alternate domain. This is illustrated in the following figures:

On the T5440, you can have up to 4 I/O domains. So you can easily make a configuration where guest domains can be resilient to the failure of the service domain by creating at least 2 I/O domains. Then you can setup guest domains so that they use IPMP on top of their virtual network interfaces, and virtual disk multipathing through the service domains you have. That way, guest domains will remain fully functional and preserve disk and network access even if a service domain is down.

Dec 24 2008, 09:15:48 AM PST Permalink

20080205 Tuesday February 05, 2008
LDoms Virtual Disks
A virtual disk is made of two components: the virtual disk itself as it appears in a domain guest, and the virtual disk backend which is where data will effectively be stored and where virtual I/Os will eventually end up. The virtual disk backend is exported from a service domain by the virtual disk server (vds) driver. The vds driver communicates with the virtual disk client (vdc) driver in the guest domain through the hypervisor using a logical domain channel (ldc). Finally a virtual disk appears as a /dev/[r]dsk/cXdYsZ device in the guest domain.

The virtual disk backend can be a physical disk, a physical disk slice, a file or a volume from a volume management framework (like ZFS, SVM, VxVM...).

A backend is exported from a domain with the command "ldm add-vdsdev" (or "ldm add-vdiskserverdevice"):

    # ldm add-vdsdev <backend> <volume_name>@<service_name>
And it is assigned to another domain with the command "ldm add-vdisk":
    # ldm add-vdisk <disk_name> <volume_name>@<service_name> <domain>
Note that a backend is effectively exported when the domain <domain> is bound.

Virtual Disk Export Options

There are two ways a backend can be exported as a virtual disk, either as a full disk or as a single slice disk. Currently, the way a backend is exported (either as a full disk or as single slice disk) depends on the type of backend (whether it is a disk, a slice, a file or a volume). The next section (Virtual Disk Backend) explains how each type of backend is exported.

Virtual Disk Backend

The virtual disk backend is the location where data of a virtual disk are effectively be stored. This backend can be a physical disk, a physical disk slice, a file or a volume (ZFS, SVM, VxVM...). The way a backend is exported (either as a full disk or as single slice disk) depends on the type of backend. Remember that it is not possible to install Solaris on a single slice disk.

Here is some additional information about each type of backend:

FAQ


Feb 05 2008, 06:09:50 AM PST Permalink Comments [2]

20070925 Tuesday September 25, 2007
Virtual Disk Label Problem is Fixed
Bug 6544963 was an annoying problem which could cause some virtual disks to lose their disk label when rebinding a domain, and the fcksum script had been provided as a workaround. This problem is now fixed in Solaris 10 8/07 which has recently been released; the fix for this problem is also available in the Solaris 10 patch 120011-14.

Solaris 10 8/07 and patch 120011-14 contain several other fixes for LDoms, you can check Liam's blog for details.


Sep 25 2007, 02:25:57 AM PDT Permalink

20070821 Tuesday August 21, 2007
Devices and PCI Buses on Sun Fire T2000
I have recently described how to setup a split PCI configuration with LDoms on a Sun Fire T2000. But I have just discovered (thanks to my co-worker Jochen Behrens) that the device layout is different on the different version of the Sun Fire T2000:

Aug 21 2007, 08:54:18 AM PDT Permalink

20070820 Monday August 20, 2007
Utility to easily use Linux disk images with LDoms
After my first look at Linux with LDoms, I did more investigations on what is wrong with the disk label from Linux images to run with LDoms. The problem is that the disk label created by Linux (GNU Parted Custom label) is not entirely correct: it does not defined the number of partitions (which should be 8) and it does not mark the disk vtoc as sane (a sane vtoc is identified with the value 0x600DDEEE). Hence the LDoms virtual disk server does not recognize this disk label as a valid Sun VTOC label.

So I did a small program (vdlinux) which corrects these invalid values. It just have to be run on a Linux disk image and then that disk image can be run with LDoms without having to do any fancy tricks (like changing the disk label before and after binding the domain). The utility also applies the workaround for bug 6544963 if this is needed.

# vdlinux bootlinux 
Incorrect number of partition (0).
Updating number of partition to 8.
Incorrect vtoc sanity (0).
Correcting vtoc sanity (600ddeee).
Applying workaround for bug 6544963.
Updating label checksum.
You only needs to run the program one time on your Linux disk image. So it is best to do it right after you have generated your disk image and unbind the domain. If you execute the program another time, it will just say that the label is correct:
# vdlinux bootlinux 
Label looks correct.
Then you can start your Linux domain as a regular domain without having to care about the label of the Linux disk image.

The vdlinux utility is available here:

The source file can be compiled with the following command:
cc -o vdlinux vdlinux.c

Aug 20 2007, 09:31:11 AM PDT Permalink

20070816 Thursday August 16, 2007
Split PCI with LDoms
With LDoms, you have the ability to partition hardware resources so that you can create multiple domains which have direct access to the hardware of the system. This is very similar to the dynamic system domains feature available on Sun's mid-range and high-end systems such as the Sun Fire E20K/E25K or the new Sun SPARC Enterprise M9000 or M8000. The main difference is that with LDoms the partitioning is done by the software through the hypervisor layer while it is done by the hardware for dynamic system domains.

I/O domains

Logical domains which have direct access to the hardware are called I/O domains. Obviously, you will have at least one I/O domain and this will be the first domain created on the system i.e. the primary domain. Then you can create additional I/O domains by removing some hardware resources from the primary domain and assigning them to another domain. Finally the number of I/O domains you can create depends on the hardware resources available on your system so it eventually depends on the type of system you are using.

PCI buses on Sun Fire T2000

Let's look at the Sun Fire T2000 Server for a concrete example. On this system, the smallest hardware resource you can assign to a domain is an entire PCI bus; and the Sun Fire T2000 has only two PCI buses hence you can create a maximum of two I/O domains.

The two PCI buses of the Sun Fire T2000 server are initially assigned to the primary domain. The buses are identified as pci@780 (or bus_a) and pci@7c0 (or bus_b) and they are connected the following devices:

As you can see, both buses have two network interfaces, but other resources are not so evenly spread: pci@7c0 (bus_b) has all the internal disks, the DVD-ROM and 4 PCI slots while pci@780 (bus_a) has only one PCI slot.

So there is no problem to create an I/O domain with bus pci@7c0 (bus_b) because you can have all the basic hardware resources you need (i.e. a disk and a network interface). But when using bus pci@780 (bus_a), you only get some network interfaces but no disk. Hence to create an I/O domain with pci@780 (bus_a) you will have to add a PCI-E card (either a Fiber Channel or a SCSI host adapter) in the PCI-E slot 0 to get access to some storage devices. You also have to ensure that the card you are adding can be used to boot the system.

Configuration of the primary domain

Initially both PCI buses are assigned to the primary domain. You can verify this with the "ldm list-bindings" command:

primary# ldm list-bindings primary
...
    IO:    pci@780 (bus_a)
           pci@7c0 (bus_b)
...
However to be able to split the PCI buses, the primary domain should be using devices from only one PCI bus and, most of the time, you will use devices from bus pci@7c0 (bus_b) because the system disk of the primary domain is an internal disk.

After checking that the primary domain is only using devices from bus pci@7c0 (bus_b), you can remove bus pci@780 (bus_a) from the configuration of the primary domain. This can be done using the "ldm remove-io" command:

primary# ldm remove-io bus_a primary
The reconfiguration is not immediate and you will have to reboot the primary domain so that the removal of pci@780 (bus_a) gets effective. After the primary domain is rebooted, you can check that it now only owns bus pci@7c0 (bus_b):
primary# ldm list-bindings primary
...
    IO:    pci@7c0 (bus_b)
...

Configuration of the alternate I/O domain

Now that PCI bus pci@780 (bus_a) is available, you can assign it to another domain. To do so, you just have to use the "ldm add-io" command while configuring your alternate domain:

primary# ldm create alternate
primary# ldm set-vcpu 4 alternate
primary# ldm set-mem 4G alternate
primary# ldm add-io bus_a alternate
This creates an alternate I/O domain with 4 cpus, 4GB of memory and the PCI bus pci@780 (bus_a). After the alternate domain is configured, it can be started as a regular domain with the "ldm bind" and "ldm start" commands;
primary# ldm bind alternate
primary# ldm start alternate
When the alternate domain is bound, you can check that it is using bus pci@780 (bus_a):
primary# ldm list-bindings alternate
...
    IO:    pci@780 (bus_a)
...
And you can connect the console of that domain to install it. The installation can be done through the network with a "boot net" like for installing a regular Sparc system.

Differences on Sun Fire T1000

You can setup the same configuration on a Sun Fire T1000 Server. The Sun Fire T1000 has two PCI buses similar to the two PCI buses of the Sun Fire T2000: pci@780 (bus_a) and pci@7c0 (bus_b). But the Sun Fire T1000 has no PCI-E and PCI-X slots on bus pci@7c0 (bus_b). Fortunately it still has PCI-E slot 0 on bus pci@780 (bus_a) which can be used to plug a FC or SCSI host adapter to connect some storage for the alternate domain.

Virtual I/O Failover

Once you have more than one I/O domain, you can configure virtual I/O failover for guest domains. Check out Narayan's blog for details: Part One and Part Two.


Aug 16 2007, 10:31:30 AM PDT Permalink Comments [2]

20070810 Friday August 10, 2007
Linux with LDoms
I have spent some times playing with the port of Linux to LDoms that Dave Miller and Fabio have done. Dave has provided us a Linux disk image and I initially had some hard time trying to boot from this image: first because it has been a long time since haven't played with Linux especially on Sparc, second because I was trying to understand the different tricks Dave and Fabio are describing to deal with disk labels.

Linux on UltraSPARC-T2

Anyway, it eventually works and it works fine. You can see the result with a demo on Ash's blog. And note that the demo and the tests have been done on a system with an UltraSPARC-T2 processor, so Linux does work with the UltraSPARC-T2. Here is a log of the boot sequence:

{0} ok boot
Boot device: rootdisk  File and args: 
SILO Version 1.4.13
boot: linux.2623
Allocated 8 Megs of memory at 0x40000000 for kernel
Loaded kernel version 2.6.23

Remapping the kernel... done.
OF stdout device is: /virtual-devices@100/console@1
Booting Linux...
[585488.953894] VIO: Adding device channel-devices
[585488.954061] VIO: Adding device vnet-port-0-0
[585488.954202] VIO: Adding device vdc-port-0-0
[585488.954350] VIO: Adding device ds-0
... snip ...
 * Running local boot scripts (/etc/rc.local)      [ OK ]

Ubuntu gutsy (development branch) t2k-linux1 ttyS0

t2k-linux1 login: root
Password:
Last login: Fri Aug 10 10:48:13 2007 on ttyS0
Linux t2k-linux1 2.6.23-rc1 #1 SMP Sun Jul 29 21:19:34 PDT 2007 sparc64

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

root@t2k-linux1:~# uname -a
Linux t2k-linux1 2.6.23-rc1 #1 SMP Sun Jul 29 21:19:34 PDT 2007 sparc64 GNU/Linux

root@t2k-linux1:~# grep CPU /proc/cpuinfo | wc -l
60

root@t2k-linux1:~# cat /proc/cpuinfo
cpu		: UltraSparc T1 (Niagara)
fpu		: UltraSparc T1 integrated FPU
prom		: OBP 4.27.0.build_03***PROTOTYPE BUILD*** 2007/07/27 18:48
type		: sun4v
ncpus probed	: 60
ncpus active	: 60
D$ parity tl1	: 0
I$ parity tl1	: 0
... snip ...
MMU Type	: Hypervisor (sun4v)
State:
CPU0:		online
CPU1:		online
CPU2:		online
CPU3:		online
... snip ...
CPU57:		online
CPU58:		online
CPU59:		online
Linux identifies the processor as an UltraSPARC-T1 but this is really an UltraSPARC-T2. The evidence is that the UltraSPARC-T1 has only 32 threads and here we have a Linux domain running with 60 (yes 60!) cpus. The UltraSPARC-T2 has 64 threads and this system was configured with a primary domain running Solaris with 4 cpus and a guest domain running Linux with 60 cpus. Note that we have to use a Linux 2.6.23 kernel to be able to boot the UltraSPARC-T2 processor.

Linux Domain bind/start Tricks

When you have a Linux disk image, Dave Miller and Fabio mention some tricky steps to be able to boot from that disk image because the LDoms virtual disk server mangles partition tables.

Here is a simpler procedure: let say you have a Linux disk image /ldoms/disklinux and you have configure the domain linux-domain to use that image. Then if you just do a "ldom bind" and "ldm start" of the linux-domain, Linux will not boot correctly. What you need to do is:

And then you can start Linux.

Why do we need to do that? On the Linux disk image, you will have a fake Sun VTOC disk label that defines 0 partition. If you directly bind and start the Linux domain with this disk label then the virtual disk server will read the label and, accordingly, it will see that no partition is defined. Then later, when Linux starts, it will request the virtual disk server to read from slice 2, but as no partition is defined the virtual disk server will return an error and Linux will be unable to read from the disk.

When we erase the disk label and bind the domain, the virtual disk server will create a default partitioning with partition 2 representing the entire disk. After the domain is bound the virtual disk server will not read the disk label again so the original label can be restored. Then when Linux will read from slice 2, there will be no problem because the virtual disk server now knows about slice 2.

This will be improved with some next version of the virtual disk server driver and probably a change in the Linux virtual disk so that none of these tricks are required to start a Linux domain.


Aug 10 2007, 11:20:04 AM PDT Permalink Comments [1]

20070731 Tuesday July 31, 2007
LDoms and Virtual Disks
When the LDoms product was first released a very annoying problem was that the format(1m) command was not working in a guest domain with virtual disks. So when one wants to change the partioning of a virtual disk, he has to use the fmthard(1m) command which is not a very user friendly command.

The good news is that bug 6531557 (format(1m) does not work with virtual disks) has just been fixed into Solaris Nevada and Open Solaris. So the format(1m) command now works with virtual disks in a LDoms guest domain. The fix of that problem for Solaris 10 should come later as a patch.

Note that there are some format(1m) sub-commands will still not work because such commands only work with SCSI disks and virtual disks do not currently appear as SCSI disks (even when a virtual disk is created from a physical SCSI disk).

The following shows which format(1m) sub-commands work with virtual disks and which do not:

FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !<cmd>     - execute <cmd>, then return
        quit
Commands in green work with virtual disks.
Commands in red do not work with virtual disks.

Another good news is that this also fixes some underlying problems such as doing I/O using an absolute disk offset, or providing the correct virtual disk size. These problems were not impacting the end-user but they were causing troubles to developers such as Dave Miller and Fabio who working hard to have Linux running with LDoms.

Although this fix does not solve all problems Dave and Fabio are facing, it introduces the foundation for a next fix which hopefully should solve everything by introducing the support for unformatted disk (bug 6575050) and this will avoid the hacks currently required to be able to use a Linux disk image.


Jul 31 2007, 10:00:07 AM PDT Permalink

20070503 Thursday May 03, 2007
Some virtual disks can lose label when rebinding a domain
Logical Domains (LDoms) can export a file as a virtual disk. Unfortunately there is a case where the label of such a disk can be lost. This is a known problem which is already fixed in OpenSolaris and that will be fixed in the next update of Solaris 10. But in the meantime, you will have to be careful when using a file as a virtual disk and you should use the following procedure and script to avoid any problem.

Problem

In some cases, when a file is used as a virtual disk, the label of that virtual disk can be lost when rebinding a domain (ldm bind) using that file (or a copy of that file) as a virtual disk.

For example, if a domain uses a file as a virtual disk and the Solaris system gets installed on that virtual disk (using boot net) then all will be running without any problem while the domain is not unbound. If the domain is unbound (using ldm unbind) then the label on the file used as a virtual disk might be lost the next time a domain using that file is bound (using ldm bind). In such a situation, the newly bound domain will be unable to use the system installed on the virtual disk and it will fail to boot with an error like "the file does not appear to be bootable" or it might fail to mount the root filesystem with an error like "vfs_mountroot: cannot mount root".

This problem is referenced as bug 6544963 (vdisk can lose label when rebinding domain)

Workaround

To prevent this problem, you need to use the following script fcksum. This script will check if you are in the case where the label will not be correctly validated during the next "ldm bind", and if this is the case it will change the label and its checksum so that it can be correctly validated. The script should be run on any file that has been used as a virtual disk and for which the disk label or disk partitioning has been changed. The script should be run right after the domain using the virtual disk is unbound (ldm unbind) for the first time.

For example, if file filedisk is used by a domain as a virtual disk and if the Solaris system is being installed onto that virtual disk then you should run the script after doing the first "ldm unbind" on that domain. Note that the script should be run before doing any "ldm bind' otherwise you can loose the disk label.

The syntax to run the script is: ./fcksum filedisk

Note that the script will first backup the existing label of the file in a file named label.file.day_time.

Here is the output you will get if your label is updated:

   $ ./fcksum rootdisk
   Backing up original label in label.rootdisk.070314_201917
                   Changing checksum
   0x1fe:          0xe456  =       0x6456
                   Changing dummy field
   0x1b8:          0       =       0x8000
                   Label checksum has been updated
Otherwise if the label does not need to be updated, you will get:
   $ ./fcksum rootdisk
   Backing up original label in label.rootdisk.070314_201005
                   Label checksum is okay

Dowmnload the fcksum script (use the Save Link as... option of your browser)


May 03 2007, 11:21:50 AM PDT Permalink