The HyperTrap
Alexandre Chartre's Weblog
Archives
« August 2008
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today
Links
All | General | LDoms
20080205 Tuesday February 05, 2008
LDoms Virtual Disks
A virtual disk is made of two components: the virtual disk itself as it appears in a domain guest, and the virtual disk backend which is where data will effectively be stored and where virtual I/Os will eventually end up. The virtual disk backend is exported from a service domain by the virtual disk server (vds) driver. The vds driver communicates with the virtual disk client (vdc) driver in the guest domain through the hypervisor using a logical domain channel (ldc). Finally a virtual disk appears as a /dev/[r]dsk/cXdYsZ device in the guest domain.

The virtual disk backend can be a physical disk, a physical disk slice, a file or a volume from a volume management framework (like ZFS, SVM, VxVM...).

A backend is exported from a domain with the command "ldm add-vdsdev" (or "ldm add-vdiskserverdevice"):

    # ldm add-vdsdev <backend> <volume_name>@<service_name>
And it is assigned to another domain with the command "ldm add-vdisk":
    # ldm add-vdisk <disk_name> <volume_name>@<service_name> <domain>
Note that a backend is effectively exported when the domain <domain> is bound.

Virtual Disk Export Options

There are two ways a backend can be exported as a virtual disk, either as a full disk or as a single slice disk. Currently, the way a backend is exported (either as a full disk or as single slice disk) depends on the type of backend (whether it is a disk, a slice, a file or a volume). The next section (Virtual Disk Backend) explains how each type of backend is exported.

Virtual Disk Backend

The virtual disk backend is the location where data of a virtual disk are effectively be stored. This backend can be a physical disk, a physical disk slice, a file or a volume (ZFS, SVM, VxVM...). The way a backend is exported (either as a full disk or as single slice disk) depends on the type of backend. Remember that it is not possible to install Solaris on a single slice disk.

Here is some additional information about each type of backend:

FAQ


Feb 05 2008, 06:09:50 AM PST Permalink Comments [2]

20070925 Tuesday September 25, 2007
Virtual Disk Label Problem is Fixed
Bug 6544963 was an annoying problem which could cause some virtual disks to lose their disk label when rebinding a domain, and the fcksum script had been provided as a workaround. This problem is now fixed in Solaris 10 8/07 which has recently been released; the fix for this problem is also available in the Solaris 10 patch 120011-14.

Solaris 10 8/07 and patch 120011-14 contain several other fixes for LDoms, you can check Liam's blog for details.


Sep 25 2007, 02:25:57 AM PDT Permalink

20070821 Tuesday August 21, 2007
Devices and PCI Buses on Sun Fire T2000
I have recently described how to setup a split PCI configuration with LDoms on a Sun Fire T2000. But I have just discovered (thanks to my co-worker Jochen Behrens) that the device layout is different on the different version of the Sun Fire T2000:

Aug 21 2007, 08:54:18 AM PDT Permalink

20070820 Monday August 20, 2007
Utility to easily use Linux disk images with LDoms
After my first look at Linux with LDoms, I did more investigations on what is wrong with the disk label from Linux images to run with LDoms. The problem is that the disk label created by Linux (GNU Parted Custom label) is not entirely correct: it does not defined the number of partitions (which should be 8) and it does not mark the disk vtoc as sane (a sane vtoc is identified with the value 0x600DDEEE). Hence the LDoms virtual disk server does not recognize this disk label as a valid Sun VTOC label.

So I did a small program (vdlinux) which corrects these invalid values. It just have to be run on a Linux disk image and then that disk image can be run with LDoms without having to do any fancy tricks (like changing the disk label before and after binding the domain). The utility also applies the workaround for bug 6544963 if this is needed.

# vdlinux bootlinux 
Incorrect number of partition (0).
Updating number of partition to 8.
Incorrect vtoc sanity (0).
Correcting vtoc sanity (600ddeee).
Applying workaround for bug 6544963.
Updating label checksum.
You only needs to run the program one time on your Linux disk image. So it is best to do it right after you have generated your disk image and unbind the domain. If you execute the program another time, it will just say that the label is correct:
# vdlinux bootlinux 
Label looks correct.
Then you can start your Linux domain as a regular domain without having to care about the label of the Linux disk image.

The vdlinux utility is available here:

The source file can be compiled with the following command:
cc -o vdlinux vdlinux.c

Aug 20 2007, 09:31:11 AM PDT Permalink

20070816 Thursday August 16, 2007
Split PCI with LDoms
With LDoms, you have the ability to partition hardware resources so that you can create multiple domains which have direct access to the hardware of the system. This is very similar to the dynamic system domains feature available on Sun's mid-range and high-end systems such as the Sun Fire E20K/E25K or the new Sun SPARC Enterprise M9000 or M8000. The main difference is that with LDoms the partitioning is done by the software through the hypervisor layer while it is done by the hardware for dynamic system domains.

I/O domains

Logical domains which have direct access to the hardware are called I/O domains. Obviously, you will have at least one I/O domain and this will be the first domain created on the system i.e. the primary domain. Then you can create additional I/O domains by removing some hardware resources from the primary domain and assigning them to another domain. Finally the number of I/O domains you can create depends on the hardware resources available on your system so it eventually depends on the type of system you are using.

PCI buses on Sun Fire T2000

Let's look at the Sun Fire T2000 Server for a concrete example. On this system, the smallest hardware resource you can assign to a domain is an entire PCI bus; and the Sun Fire T2000 has only two PCI buses hence you can create a maximum of two I/O domains.

The two PCI buses of the Sun Fire T2000 server are initially assigned to the primary domain. The buses are identified as pci@780 (or bus_a) and pci@7c0 (or bus_b) and they are connected the following devices:

As you can see, both buses have two network interfaces, but other resources are not so evenly spread: pci@7c0 (bus_b) has all the internal disks, the DVD-ROM and 4 PCI slots while pci@780 (bus_a) has only one PCI slot.

So there is no problem to create an I/O domain with bus pci@7c0 (bus_b) because you can have all the basic hardware resources you need (i.e. a disk and a network interface). But when using bus pci@780 (bus_a), you only get some network interfaces but no disk. Hence to create an I/O domain with pci@780 (bus_a) you will have to add a PCI-E card (either a Fiber Channel or a SCSI host adapter) in the PCI-E slot 0 to get access to some storage devices. You also have to ensure that the card you are adding can be used to boot the system.

Configuration of the primary domain

Initially both PCI buses are assigned to the primary domain. You can verify this with the "ldm list-bindings" command:

primary# ldm list-bindings primary
...
    IO:    pci@780 (bus_a)
           pci@7c0 (bus_b)
...
However to be able to split the PCI buses, the primary domain should be using devices from only one PCI bus and, most of the time, you will use devices from bus pci@7c0 (bus_b) because the system disk of the primary domain is an internal disk.

After checking that the primary domain is only using devices from bus pci@7c0 (bus_b), you can remove bus pci@780 (bus_a) from the configuration of the primary domain. This can be done using the "ldm remove-io" command:

primary# ldm remove-io bus_a primary
The reconfiguration is not immediate and you will have to reboot the primary domain so that the removal of pci@780 (bus_a) gets effective. After the primary domain is rebooted, you can check that it now only owns bus pci@7c0 (bus_b):
primary# ldm list-bindings primary
...
    IO:    pci@7c0 (bus_b)
...

Configuration of the alternate I/O domain

Now that PCI bus pci@780 (bus_a) is available, you can assign it to another domain. To do so, you just have to use the "ldm add-io" command while configuring your alternate domain:

primary# ldm create alternate
primary# ldm set-vcpu 4 alternate
primary# ldm set-mem 4G alternate
primary# ldm add-io bus_a alternate
This creates an alternate I/O domain with 4 cpus, 4GB of memory and the PCI bus pci@780 (bus_a). After the alternate domain is configured, it can be started as a regular domain with the "ldm bind" and "ldm start" commands;
primary# ldm bind alternate
primary# ldm start alternate
When the alternate domain is bound, you can check that it is using bus pci@780 (bus_a):
primary# ldm list-bindings alternate
...
    IO:    pci@780 (bus_a)
...
And you can connect the console of that domain to install it. The installation can be done through the network with a "boot net" like for installing a regular Sparc system.

Differences on Sun Fire T1000

You can setup the same configuration on a Sun Fire T1000 Server. The Sun Fire T1000 has two PCI buses similar to the two PCI buses of the Sun Fire T2000: pci@780 (bus_a) and pci@7c0 (bus_b). But the Sun Fire T1000 has no PCI-E and PCI-X slots on bus pci@7c0 (bus_b). Fortunately it still has PCI-E slot 0 on bus pci@780 (bus_a) which can be used to plug a FC or SCSI host adapter to connect some storage for the alternate domain.

Virtual I/O Failover

Once you have more than one I/O domain, you can configure virtual I/O failover for guest domains. Check out Narayan's blog for details: Part One and Part Two.


Aug 16 2007, 10:31:30 AM PDT Permalink Comments [2]

20070810 Friday August 10, 2007
Linux with LDoms
I have spent some times playing with the port of Linux to LDoms that Dave Miller and Fabio have done. Dave has provided us a Linux disk image and I initially had some hard time trying to boot from this image: first because it has been a long time since haven't played with Linux especially on Sparc, second because I was trying to understand the different tricks Dave and Fabio are describing to deal with disk labels.

Linux on UltraSPARC-T2

Anyway, it eventually works and it works fine. You can see the result with a demo on Ash's blog. And note that the demo and the tests have been done on a system with an UltraSPARC-T2 processor, so Linux does work with the UltraSPARC-T2. Here is a log of the boot sequence:

{0} ok boot
Boot device: rootdisk  File and args: 
SILO Version 1.4.13
boot: linux.2623
Allocated 8 Megs of memory at 0x40000000 for kernel
Loaded kernel version 2.6.23

Remapping the kernel... done.
OF stdout device is: /virtual-devices@100/console@1
Booting Linux...
[585488.953894] VIO: Adding device channel-devices
[585488.954061] VIO: Adding device vnet-port-0-0
[585488.954202] VIO: Adding device vdc-port-0-0
[585488.954350] VIO: Adding device ds-0
... snip ...
 * Running local boot scripts (/etc/rc.local)      [ OK ]

Ubuntu gutsy (development branch) t2k-linux1 ttyS0

t2k-linux1 login: root
Password:
Last login: Fri Aug 10 10:48:13 2007 on ttyS0
Linux t2k-linux1 2.6.23-rc1 #1 SMP Sun Jul 29 21:19:34 PDT 2007 sparc64

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

root@t2k-linux1:~# uname -a
Linux t2k-linux1 2.6.23-rc1 #1 SMP Sun Jul 29 21:19:34 PDT 2007 sparc64 GNU/Linux

root@t2k-linux1:~# grep CPU /proc/cpuinfo | wc -l
60

root@t2k-linux1:~# cat /proc/cpuinfo
cpu		: UltraSparc T1 (Niagara)
fpu		: UltraSparc T1 integrated FPU
prom		: OBP 4.27.0.build_03***PROTOTYPE BUILD*** 2007/07/27 18:48
type		: sun4v
ncpus probed	: 60
ncpus active	: 60
D$ parity tl1	: 0
I$ parity tl1	: 0
... snip ...
MMU Type	: Hypervisor (sun4v)
State:
CPU0:		online
CPU1:		online
CPU2:		online
CPU3:		online
... snip ...
CPU57:		online
CPU58:		online
CPU59:		online
Linux identifies the processor as an UltraSPARC-T1 but this is really an UltraSPARC-T2. The evidence is that the UltraSPARC-T1 has only 32 threads and here we have a Linux domain running with 60 (yes 60!) cpus. The UltraSPARC-T2 has 64 threads and this system was configured with a primary domain running Solaris with 4 cpus and a guest domain running Linux with 60 cpus. Note that we have to use a Linux 2.6.23 kernel to be able to boot the UltraSPARC-T2 processor.

Linux Domain bind/start Tricks

When you have a Linux disk image, Dave Miller and Fabio mention some tricky steps to be able to boot from that disk image because the LDoms virtual disk server mangles partition tables.

Here is a simpler procedure: let say you have a Linux disk image /ldoms/disklinux and you have configure the domain linux-domain to use that image. Then if you just do a "ldom bind" and "ldm start" of the linux-domain, Linux will not boot correctly. What you need to do is:

And then you can start Linux.

Why do we need to do that? On the Linux disk image, you will have a fake Sun VTOC disk label that defines 0 partition. If you directly bind and start the Linux domain with this disk label then the virtual disk server will read the label and, accordingly, it will see that no partition is defined. Then later, when Linux starts, it will request the virtual disk server to read from slice 2, but as no partition is defined the virtual disk server will return an error and Linux will be unable to read from the disk.

When we erase the disk label and bind the domain, the virtual disk server will create a default partitioning with partition 2 representing the entire disk. After the domain is bound the virtual disk server will not read the disk label again so the original label can be restored. Then when Linux will read from slice 2, there will be no problem because the virtual disk server now knows about slice 2.

This will be improved with some next version of the virtual disk server driver and probably a change in the Linux virtual disk so that none of these tricks are required to start a Linux domain.


Aug 10 2007, 11:20:04 AM PDT Permalink Comments [1]

20070731 Tuesday July 31, 2007
LDoms and Virtual Disks
When the LDoms product was first released a very annoying problem was that the format(1m) command was not working in a guest domain with virtual disks. So when one wants to change the partioning of a virtual disk, he has to use the fmthard(1m) command which is not a very user friendly command.

The good news is that bug 6531557 (format(1m) does not work with virtual disks) has just been fixed into Solaris Nevada and Open Solaris. So the format(1m) command now works with virtual disks in a LDoms guest domain. The fix of that problem for Solaris 10 should come later as a patch.

Note that there are some format(1m) sub-commands will still not work because such commands only work with SCSI disks and virtual disks do not currently appear as SCSI disks (even when a virtual disk is created from a physical SCSI disk).

The following shows which format(1m) sub-commands work with virtual disks and which do not:

FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !<cmd>     - execute <cmd>, then return
        quit
Commands in green work with virtual disks.
Commands in red do not work with virtual disks.

Another good news is that this also fixes some underlying problems such as doing I/O using an absolute disk offset, or providing the correct virtual disk size. These problems were not impacting the end-user but they were causing troubles to developers such as Dave Miller and Fabio who working hard to have Linux running with LDoms.

Although this fix does not solve all problems Dave and Fabio are facing, it introduces the foundation for a next fix which hopefully should solve everything by introducing the support for unformatted disk (bug 6575050) and this will avoid the hacks currently required to be able to use a Linux disk image.


Jul 31 2007, 10:00:07 AM PDT Permalink

20070503 Thursday May 03, 2007
Some virtual disks can lose label when rebinding a domain
Logical Domains (LDoms) can export a file as a virtual disk. Unfortunately there is a case where the label of such a disk can be lost. This is a known problem which is already fixed in OpenSolaris and that will be fixed in the next update of Solaris 10. But in the meantime, you will have to be careful when using a file as a virtual disk and you should use the following procedure and script to avoid any problem.

Problem

In some cases, when a file is used as a virtual disk, the label of that virtual disk can be lost when rebinding a domain (ldm bind) using that file (or a copy of that file) as a virtual disk.

For example, if a domain uses a file as a virtual disk and the Solaris system gets installed on that virtual disk (using boot net) then all will be running without any problem while the domain is not unbound. If the domain is unbound (using ldm unbind) then the label on the file used as a virtual disk might be lost the next time a domain using that file is bound (using ldm bind). In such a situation, the newly bound domain will be unable to use the system installed on the virtual disk and it will fail to boot with an error like "the file does not appear to be bootable" or it might fail to mount the root filesystem with an error like "vfs_mountroot: cannot mount root".

This problem is referenced as bug 6544963 (vdisk can lose label when rebinding domain)

Workaround

To prevent this problem, you need to use the following script fcksum. This script will check if you are in the case where the label will not be correctly validated during the next "ldm bind", and if this is the case it will change the label and its checksum so that it can be correctly validated. The script should be run on any file that has been used as a virtual disk and for which the disk label or disk partitioning has been changed. The script should be run right after the domain using the virtual disk is unbound (ldm unbind) for the first time.

For example, if file filedisk is used by a domain as a virtual disk and if the Solaris system is being installed onto that virtual disk then you should run the script after doing the first "ldm unbind" on that domain. Note that the script should be run before doing any "ldm bind' otherwise you can loose the disk label.

The syntax to run the script is: ./fcksum filedisk

Note that the script will first backup the existing label of the file in a file named label.file.day_time.

Here is the output you will get if your label is updated:

   $ ./fcksum rootdisk
   Backing up original label in label.rootdisk.070314_201917
                   Changing checksum
   0x1fe:          0xe456  =       0x6456
                   Changing dummy field
   0x1b8:          0       =       0x8000
                   Label checksum has been updated
Otherwise if the label does not need to be updated, you will get:
   $ ./fcksum rootdisk
   Backing up original label in label.rootdisk.070314_201005
                   Label checksum is okay

Dowmnload the fcksum script (use the Save Link as... option of your browser)


May 03 2007, 11:21:50 AM PDT Permalink

From Bad Trap to Hyper Trap
I am doing the yearly update of my blog :-) Hopefully I will update it more frequently now. Actually I have a lot of things to blog about and I really need to take the time to do it.

So first I am changing the name of my blog from "The Bad Trap" to "The Hyper Trap" because I am not dealing with Solaris "Bad Trap" bugs anymore, instead I am now working with the hyper traps of the new SPARC processors. Concretely this means that I have moved from the Solaris Sustaining group to the Niagara/Rock Software Engineering group and I am now part of the Logical Domains development team.

Logical Domains (LDoms) is a brand new product which has been released few days ago. This is a virtualization solution for SPARC systems which allows to create up to 32 domains on a Sun Fire T1000 or T2000 server. Each domain can then run its own operating system. The virtualization is done using a thin hypervisor layer which is directly implemented in the firmware of the system and which uses some special features of the UltraSPARC-T1 processor (aka Niagara).

With this position and job change, I also moved from France to Menlo Park in California, and I just arrived in the US four weeks ago. These first weeks were very busy to settle in but now this is starting to be quieter so more time should be available to blog.


May 03 2007, 10:01:07 AM PDT Permalink

20060320 Monday March 20, 2006
Introduction
I am finally taking some time to start my blog. So let's start with a short introduction: my name is Alexandre Chartre, I am based in Grenoble (France) and I am working for the Solaris sustaining team. For short this means that I am engineer and that I am spending most of my time troubleshooting problems and fixing bugs!

The main product I work on is Sun Cluster. This software offers application service continuity by providing high-availability at the application level: that means that an application can run on several different machines and if a machine becomes unavailable then the application will be automatically switched over to another system with minimal service downtime (well, that's a quick one line summary).

My favorite topics are the Solaris kernel, debugging and troubleshooting. Sun Cluster is then a very interesting software to work on because a major part is implemented in the kernel and it interacts with many kernel components. So this makes my job very interesting and challenging.


Mar 20 2006, 08:22:47 AM PST Permalink Comments [1]