Tuesday February 05, 2008
The virtual disk backend can be a physical disk, a physical disk slice, a file or a volume from a volume management framework (like ZFS, SVM, VxVM...).
A backend is exported from a domain with the command "ldm add-vdsdev" (or "ldm add-vdiskserverdevice"):
# ldm add-vdsdev <backend> <volume_name>@<service_name>
And it is assigned to another domain with the command "ldm add-vdisk":
# ldm add-vdisk <disk_name> <volume_name>@<service_name> <domain>
Note that a backend is effectively exported when the domain <domain> is bound.
Virtual Disk Export Options
There are two ways a backend can be exported as a virtual disk, either as a full disk or as a single slice disk. Currently, the way a backend is exported (either as a full disk or as single slice disk) depends on the type of backend (whether it is a disk, a slice, a file or a volume). The next section (Virtual Disk Backend) explains how each type of backend is exported.
When a backend is exported to a domain as a full disk, it will appear in that domain as a regular disk with 8 slices (s0 to s7). Such a disk is visible with the format(1m) command and its partition table can be changed using either the fmthard(1m) or format(1m) command.
A full disk will also be visible from the Solaris installer and can be selected as a disk device on which Solaris can be installed.
When a backend is exported to a domain as a single slice disk, it will appear in that domain as a disk with a single partition (s0). Such a disk is not visible with the format(1m) command and its partition table can not be changed.
A single slice disk will not be visible from the Solaris installer and can not be select as a disk device on which Solaris can be installed.
The virtual disk backend is the location where data of a virtual disk are effectively be stored. This backend can be a physical disk, a physical disk slice, a file or a volume (ZFS, SVM, VxVM...). The way a backend is exported (either as a full disk or as single slice disk) depends on the type of backend. Remember that it is not possible to install Solaris on a single slice disk.
| Backend | Export | Solaris Installation |
|---|---|---|
| Physical Disk | Full Disk | Possible |
| Physical Disk Slice | Single Slice Disk | Not Possible |
| File | Full Disk | Possible |
| Volume (ZFS, SVM, VxVM...) | Single Slice Disk* | Not Possible* |
(*) This will change once bug 6514091 (vDisk server should export volumes as full disks) is fixed.
Here is some additional information about each type of backend:
A physical disk is exported as a full disk. In that case, virtual disk drivers (vds and vdc) forward I/Os from the virtual disk and act as a pass-through to the physical disk.
A physical disk can be exported by exporting the slice 2 (s2) of the disk.
Example: exporting a physical disk as a virtual disk
To export the physical disk c1t48d0 as a virtual disk, we have to export the slice 2 of that disk (c1t48d0s2):
# ldm add-vdsdev /dev/dsk/c1t48d0s2 c1t48d0@primary-vds0
Once the disk is exported, it can be assigned to a domain. Here it is
assigned to the domain "test":
# ldm add-vdisk pdisk c1t48d0@primary-vds0 test
Finally the disk is accessible from the guest domain "test" as a full
disk (i.e. a regular disk with 8 slices); here the disk is accessible as c0d1:
# ls -1 /dev/dsk/c0d1s*
/dev/dsk/c0d1s0
/dev/dsk/c0d1s1
/dev/dsk/c0d1s2
/dev/dsk/c0d1s3
/dev/dsk/c0d1s4
/dev/dsk/c0d1s5
/dev/dsk/c0d1s6
/dev/dsk/c0d1s7
A physical disk slice is exported as a single slice disk. In that case, virtual disk drivers (vds and vdc) forward I/Os from the virtual disk and act as a pass-through to the physical disk slice.
Example: exporting a physical disk slice as a virtual disk
To export the slice 0 of the physical disk c1t57d0 as a virtual disk, we have to export the device corresponding to that slice (c1t57d0s0):
# ldm add-vdsdev /dev/dsk/c1t57d0s0 c1t57d0s0@primary-vds0
Once the disk is exported, it can be assigned to a domain. Here it is
assigned to the domain "test":
# ldm add-vdisk pslice c1t57d0s0@primary-vds0 test
Finally the disk is accessible from the guest domain "test" as a single
slice disk (i.e. a disk with only 1 slice: s0); here the disk is accessible
as c0d13:
# # ls -1 /dev/dsk/c0d13s*
/dev/dsk/c0d13s0
A file is exported as a full disk. In that case, virtual disk drivers (vds and vdc) forward I/Os from the virtual disk and manage the partitioning of the virtual disk. The file is eventually a disk image storing data of all slices of the virtual disk.
When a file is exported as a virtual disk and no partitionning information is stored into that file then the system will automatically write a default disk label into the file and define a default partionning with two slices (0 and 2) covering the entire disk. Note that this behavior will change once bug 6575050 (vds should support unformatted disks) is fixed.
Example: exporting a file as a virtual disk
To export the file /ldoms/domain/test/fdisk0 as a virtual disk, we first have to create it. The size of the file will define the size of the virtual disk. Here we create a 100mb blank file to get a 100mb virtual disk:
# mkfile 100m /ldoms/domain/test/fdisk0
Then the file can be directly exported as a virtual disk:
# ldm add-vdsdev /ldoms/domain/test/fdisk0 fdisk0@primary-vds0
Once the file is exported, it can be assigned to a domain. Here it is
assigned to the domain "test":
# ldm add-vdisk fdisk fdisk0@primary-vds0 test
Finally the disk is accessible from the guest domain "test" as a full
disk (i.e. a regular disk with 8 slices); here the disk is accessible as c0d5:
# ls -1 /dev/dsk/c0d5s*
/dev/dsk/c0d5s0
/dev/dsk/c0d5s1
/dev/dsk/c0d5s2
/dev/dsk/c0d5s3
/dev/dsk/c0d5s4
/dev/dsk/c0d5s5
/dev/dsk/c0d5s6
/dev/dsk/c0d5s7
A volume is exported as a single slice disk. In that case, virtual disk drivers (vds and vdc) forward I/Os from the virtual disk and act as a pass-through to the volume.
Example: exporting a ZFS volume as a virtual disk
To export the ZFS volume zdisk0 as a virtual disk, we first have to create it. The size of the volume will define the size of the virtual disk. Here we create a 100mb volume to get a 100mb virtual disk:
# zfs create -V 100m ldoms/domain/test/zdisk0
Then we have to export the device corresponding to that ZFS volume:
# ldm add-vdsdev /dev/zvol/dsk/ldoms/domain/test/zdisk0 zdisk0@primary-vds0
Once the volume is exported, it can be assigned to a domain. Here it is
assigned to the domain "test":
# ldm add-vdisk zdisk0 zdisk0@primary-vds0 test
Finally the disk is accessible from the guest domain "test" as a single slice disk
(i.e. a disk with only 1 slice: s0); here the disk is accessible as c0d9:
# ls -1 /dev/dsk/c0d9s*
/dev/dsk/c0d9s0
FAQ
You have probably only exported single slice disks (disk slices or volume). The Solaris installer does not handle single slice disks so it thinks that the system has no disk and the installation fails.
You need to export a full disk (a physical disk or a file) and start the installation again.
You can export a physical CDROM/DVD like you export a physical disk by exporting the slice 2 (s2) of the CDROM/DVD. However the exported CDROM/DVD will be seen as a regular disk and not as a CDROM/DVD, and you can access the content of the CDROM/DVD from Solaris but you can not boot from that CDROM/DVD.
As a consequence you can export a Solaris CDROM/DVD and but you can not use it to install Solaris by booting the exported CDROM/DVD. So you can not install a guest domain from a CDROM/DVD. This will be improved when bug 6434615 (vDisk needs to support booting/installing from DVDs) is fixed.
You are probably exporting as a virtual disk a backend that is not accessible or that can not be exported (for example a file that does not exist). On the service domain, check the /var/adm/messages file for any error messages from the vds driver. This should give you some hints about what is wrong with which backend.
For example, a message like this one:
vds: [ID 877446 kern.info] vd_setup_vd(): /ldoms/domain/test/fdisk/fdisk01 is currently inaccessible (error 2)means that /ldoms/domain/test/fdisk/fdisk01 can not be exported because it does not exist (error 2 = ENOENT = No such file or directory, see "man -s2 intro").
When you export a file as a virtual disk, you may want to access the content of the disk image when the guest domain is down and you may want to mount one of the slices defined in the disk image (i.e. in the file). Unfortunately this is currently not possible.
lofi is currently not able to deal with a disk image, and it will present the disk as a one slice. If a slice is defined at the beginning of the file (i.e. at offset 0 of the virtual disk) then you may be able to access that slice using lofi, but any other slice will be inaccessible. This should be improved once bug 4765069 (RFE: lofiadm should be VTOC aware) is fixed.
Currently the only way to access any slice of a disk image (file) is to create a guest domain, export the file as a virtual disk to that domain and access the corresponding slice of the virtual disk from the guest domain.
Solaris 10 8/07 and patch 120011-14 contain several other fixes for LDoms, you can check Liam's blog for details.
So if you want to setup a split PCI configuration on this newer Sun Fire T2000 you have
to add either a Fiber Channel or a SCSI host adapter in one of the PCI-E or PCI-X slots
on bus pci@7c0 (bus_b) like this:
So I did a small program (vdlinux) which corrects these invalid values. It just have to be run on a Linux disk image and then that disk image can be run with LDoms without having to do any fancy tricks (like changing the disk label before and after binding the domain). The utility also applies the workaround for bug 6544963 if this is needed.
# vdlinux bootlinux Incorrect number of partition (0). Updating number of partition to 8. Incorrect vtoc sanity (0). Correcting vtoc sanity (600ddeee). Applying workaround for bug 6544963. Updating label checksum.You only needs to run the program one time on your Linux disk image. So it is best to do it right after you have generated your disk image and unbind the domain. If you execute the program another time, it will just say that the label is correct:
# vdlinux bootlinux Label looks correct.Then you can start your Linux domain as a regular domain without having to care about the label of the Linux disk image.
The vdlinux utility is available here:
The source file can be compiled with the following command:cc -o vdlinux vdlinux.c
I/O domains
Logical domains which have direct access to the hardware are called I/O domains. Obviously, you will have at least one I/O domain and this will be the first domain created on the system i.e. the primary domain. Then you can create additional I/O domains by removing some hardware resources from the primary domain and assigning them to another domain. Finally the number of I/O domains you can create depends on the hardware resources available on your system so it eventually depends on the type of system you are using.
PCI buses on Sun Fire T2000
Let's look at the Sun Fire T2000 Server for a concrete example. On this system, the smallest hardware resource you can assign to a domain is an entire PCI bus; and the Sun Fire T2000 has only two PCI buses hence you can create a maximum of two I/O domains.
The two PCI buses of the Sun Fire T2000 server are initially assigned to the primary domain. The buses are identified as pci@780 (or bus_a) and pci@7c0 (or bus_b) and they are connected the following devices:
As you can see, both buses have two network interfaces, but other resources are not so evenly spread: pci@7c0 (bus_b) has all the internal disks, the DVD-ROM and 4 PCI slots while pci@780 (bus_a) has only one PCI slot.
So there is no problem to create an I/O domain with bus pci@7c0 (bus_b) because you can have all the basic hardware resources you need (i.e. a disk and a network interface). But when using bus pci@780 (bus_a), you only get some network interfaces but no disk. Hence to create an I/O domain with pci@780 (bus_a) you will have to add a PCI-E card (either a Fiber Channel or a SCSI host adapter) in the PCI-E slot 0 to get access to some storage devices. You also have to ensure that the card you are adding can be used to boot the system.
Configuration of the primary domain
Initially both PCI buses are assigned to the primary domain. You can verify this with the "ldm list-bindings" command:
primary# ldm list-bindings primary
...
IO: pci@780 (bus_a)
pci@7c0 (bus_b)
...
However to be able to split the PCI buses, the primary domain should be using
devices from only one PCI bus and, most of the time, you will use devices from bus
pci@7c0 (bus_b) because the system disk of the primary domain is an internal disk.
You can check the disks used by looking at the path of the disk devices:
primary# ls -l /dev/dsk ... lrwxrwxrwx 1 root root 65 Feb 2 17:19 /dev/dsk/c1t0d0s0 -> ../../devices/pci@7c0/pci@0/pci@1/pci@0,2/LSILogic,sas@2/sd@0,0:a ...You have to ensure that the system disk is on bus pci@7c0 (bus_b) and that any disk on bus pci@780 (bus_a) is not being used by the primary domain.
You also have to check that the network interfaces used by the primary domain are also on bus pci@7c0 (bus_b). To do so, look at the path of the e1000g interfaces:
primary# ls -l /dev/e1000g*You have to ensure that the network interfaces you are using (especially the primary network interface) are on bus pci@7c0 (bus_b).
If your primary network interface (for example e1000g0) is not on bus pci@7c0 (bus_b) then you will have to reconfigure your system so that it uses another interface (for example e1000g2) which has to be on bus pci@7c0 (bus_b). If you have to change the network interface, don't forget to correctly reconnect the network cables (for example move the network cable from e1000g0 to e1000g2).
After checking that the primary domain is only using devices from bus pci@7c0 (bus_b), you can remove bus pci@780 (bus_a) from the configuration of the primary domain. This can be done using the "ldm remove-io" command:
primary# ldm remove-io bus_a primaryThe reconfiguration is not immediate and you will have to reboot the primary domain so that the removal of pci@780 (bus_a) gets effective. After the primary domain is rebooted, you can check that it now only owns bus pci@7c0 (bus_b):
primary# ldm list-bindings primary
...
IO: pci@7c0 (bus_b)
...
Configuration of the alternate I/O domain
Now that PCI bus pci@780 (bus_a) is available, you can assign it to another domain. To do so, you just have to use the "ldm add-io" command while configuring your alternate domain:
primary# ldm create alternate primary# ldm set-vcpu 4 alternate primary# ldm set-mem 4G alternate primary# ldm add-io bus_a alternateThis creates an alternate I/O domain with 4 cpus, 4GB of memory and the PCI bus pci@780 (bus_a). After the alternate domain is configured, it can be started as a regular domain with the "ldm bind" and "ldm start" commands;
primary# ldm bind alternate primary# ldm start alternateWhen the alternate domain is bound, you can check that it is using bus pci@780 (bus_a):
primary# ldm list-bindings alternate
...
IO: pci@780 (bus_a)
...
And you can connect the console of that domain to install it. The installation can be
done through the network with a "boot net" like for installing a regular Sparc system.
Differences on Sun Fire T1000
You can setup the same configuration on a Sun Fire T1000 Server. The Sun Fire T1000 has two PCI buses similar to the two PCI buses of the Sun Fire T2000: pci@780 (bus_a) and pci@7c0 (bus_b). But the Sun Fire T1000 has no PCI-E and PCI-X slots on bus pci@7c0 (bus_b). Fortunately it still has PCI-E slot 0 on bus pci@780 (bus_a) which can be used to plug a FC or SCSI host adapter to connect some storage for the alternate domain.
Virtual I/O Failover
Once you have more than one I/O domain, you can configure virtual I/O failover for guest domains. Check out Narayan's blog for details: Part One and Part Two.
Linux on UltraSPARC-T2
Anyway, it eventually works and it works fine. You can see the result with a demo on Ash's blog. And note that the demo and the tests have been done on a system with an UltraSPARC-T2 processor, so Linux does work with the UltraSPARC-T2. Here is a log of the boot sequence:
{0} ok boot
Boot device: rootdisk File and args:
SILO Version 1.4.13
boot: linux.2623
Allocated 8 Megs of memory at 0x40000000 for kernel
Loaded kernel version 2.6.23
Remapping the kernel... done.
OF stdout device is: /virtual-devices@100/console@1
Booting Linux...
[585488.953894] VIO: Adding device channel-devices
[585488.954061] VIO: Adding device vnet-port-0-0
[585488.954202] VIO: Adding device vdc-port-0-0
[585488.954350] VIO: Adding device ds-0
... snip ...
* Running local boot scripts (/etc/rc.local) [ OK ]
Ubuntu gutsy (development branch) t2k-linux1 ttyS0
t2k-linux1 login: root
Password:
Last login: Fri Aug 10 10:48:13 2007 on ttyS0
Linux t2k-linux1 2.6.23-rc1 #1 SMP Sun Jul 29 21:19:34 PDT 2007 sparc64
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
root@t2k-linux1:~# uname -a
Linux t2k-linux1 2.6.23-rc1 #1 SMP Sun Jul 29 21:19:34 PDT 2007 sparc64 GNU/Linux
root@t2k-linux1:~# grep CPU /proc/cpuinfo | wc -l
60
root@t2k-linux1:~# cat /proc/cpuinfo
cpu : UltraSparc T1 (Niagara)
fpu : UltraSparc T1 integrated FPU
prom : OBP 4.27.0.build_03***PROTOTYPE BUILD*** 2007/07/27 18:48
type : sun4v
ncpus probed : 60
ncpus active : 60
D$ parity tl1 : 0
I$ parity tl1 : 0
... snip ...
MMU Type : Hypervisor (sun4v)
State:
CPU0: online
CPU1: online
CPU2: online
CPU3: online
... snip ...
CPU57: online
CPU58: online
CPU59: online
Linux identifies the processor as an UltraSPARC-T1 but this is really an UltraSPARC-T2.
The evidence is that the UltraSPARC-T1 has only
32 threads and here we have a Linux domain running with 60 (yes 60!) cpus. The
UltraSPARC-T2 has 64 threads and this system was configured with a primary domain
running Solaris with 4 cpus and a guest domain running Linux with 60 cpus. Note
that we have to use a Linux 2.6.23 kernel to be able to boot the UltraSPARC-T2 processor.
Linux Domain bind/start Tricks
When you have a Linux disk image, Dave Miller and Fabio mention some tricky steps to be able to boot from that disk image because the LDoms virtual disk server mangles partition tables.
Here is a simpler procedure: let say you have a Linux disk image /ldoms/disklinux and you have configure the domain linux-domain to use that image. Then if you just do a "ldom bind" and "ldm start" of the linux-domain, Linux will not boot correctly. What you need to do is:
# dd if=/ldoms/disklinux of=/ldoms/labellinux count=1
# dd if=/dev/zero of=/ldoms/disklinux count=1 conv=notrunc
# ldm bind linux-domain
# dd if=/ldoms/labellinux of=/ldoms/disklinux count=1 conv=notrunc
# ldm start linux-domain
And then you can start Linux.
Why do we need to do that? On the Linux disk image, you will have a fake Sun VTOC disk label that defines 0 partition. If you directly bind and start the Linux domain with this disk label then the virtual disk server will read the label and, accordingly, it will see that no partition is defined. Then later, when Linux starts, it will request the virtual disk server to read from slice 2, but as no partition is defined the virtual disk server will return an error and Linux will be unable to read from the disk.
When we erase the disk label and bind the domain, the virtual disk server will create a default partitioning with partition 2 representing the entire disk. After the domain is bound the virtual disk server will not read the disk label again so the original label can be restored. Then when Linux will read from slice 2, there will be no problem because the virtual disk server now knows about slice 2.
This will be improved with some next version of the virtual disk server driver and probably a change in the Linux virtual disk so that none of these tricks are required to start a Linux domain.
The good news is that bug 6531557 (format(1m) does not work with virtual disks) has just been fixed into Solaris Nevada and Open Solaris. So the format(1m) command now works with virtual disks in a LDoms guest domain. The fix of that problem for Solaris 10 should come later as a patch.
Note that there are some format(1m) sub-commands will still not work because such commands only work with SCSI disks and virtual disks do not currently appear as SCSI disks (even when a virtual disk is created from a physical SCSI disk).
The following shows which format(1m) sub-commands work with virtual disks and which do not:
FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
Commands in green work with virtual disks.Another good news is that this also fixes some underlying problems such as doing I/O using an absolute disk offset, or providing the correct virtual disk size. These problems were not impacting the end-user but they were causing troubles to developers such as Dave Miller and Fabio who working hard to have Linux running with LDoms.
Although this fix does not solve all problems Dave and Fabio are facing, it introduces the foundation for a next fix which hopefully should solve everything by introducing the support for unformatted disk (bug 6575050) and this will avoid the hacks currently required to be able to use a Linux disk image.
Problem
In some cases, when a file is used as a virtual disk, the label of that virtual disk can be lost when rebinding a domain (ldm bind) using that file (or a copy of that file) as a virtual disk.
For example, if a domain uses a file as a virtual disk and the Solaris system gets installed on that virtual disk (using boot net) then all will be running without any problem while the domain is not unbound. If the domain is unbound (using ldm unbind) then the label on the file used as a virtual disk might be lost the next time a domain using that file is bound (using ldm bind). In such a situation, the newly bound domain will be unable to use the system installed on the virtual disk and it will fail to boot with an error like "the file does not appear to be bootable" or it might fail to mount the root filesystem with an error like "vfs_mountroot: cannot mount root".
This problem is referenced as bug 6544963 (vdisk can lose label when rebinding domain)
Workaround
To prevent this problem, you need to use the following script fcksum. This script will check if you are in the case where the label will not be correctly validated during the next "ldm bind", and if this is the case it will change the label and its checksum so that it can be correctly validated. The script should be run on any file that has been used as a virtual disk and for which the disk label or disk partitioning has been changed. The script should be run right after the domain using the virtual disk is unbound (ldm unbind) for the first time.
For example, if file filedisk is used by a domain as a virtual disk and if the Solaris system is being installed onto that virtual disk then you should run the script after doing the first "ldm unbind" on that domain. Note that the script should be run before doing any "ldm bind' otherwise you can loose the disk label.
The syntax to run the script is: ./fcksum filedisk
Note that the script will first backup the existing label of the file in a file named label.file.day_time.
Here is the output you will get if your label is updated:
$ ./fcksum rootdisk
Backing up original label in label.rootdisk.070314_201917
Changing checksum
0x1fe: 0xe456 = 0x6456
Changing dummy field
0x1b8: 0 = 0x8000
Label checksum has been updated
Otherwise if the label does not need to be updated, you will get:
$ ./fcksum rootdisk
Backing up original label in label.rootdisk.070314_201005
Label checksum is okay
Dowmnload the fcksum script (use the Save Link as... option of your browser)
Hopefully I will update it more frequently now. Actually I have a lot of things to blog about and I really need to take the time to do it.
So first I am changing the name of my blog from "The Bad Trap" to "The Hyper Trap" because I am not dealing with Solaris "Bad Trap" bugs anymore, instead I am now working with the hyper traps of the new SPARC processors. Concretely this means that I have moved from the Solaris Sustaining group to the Niagara/Rock Software Engineering group and I am now part of the Logical Domains development team.
Logical Domains (LDoms) is a brand new product which has been released few days ago. This is a virtualization solution for SPARC systems which allows to create up to 32 domains on a Sun Fire T1000 or T2000 server. Each domain can then run its own operating system. The virtualization is done using a thin hypervisor layer which is directly implemented in the firmware of the system and which uses some special features of the UltraSPARC-T1 processor (aka Niagara).
With this position and job change, I also moved from France to Menlo Park in California, and I just arrived in the US four weeks ago. These first weeks were very busy to settle in but now this is starting to be quieter so more time should be available to blog.
The main product I work on is Sun Cluster. This software offers application service continuity by providing high-availability at the application level: that means that an application can run on several different machines and if a machine becomes unavailable then the application will be automatically switched over to another system with minimal service downtime (well, that's a quick one line summary).
My favorite topics are the Solaris kernel, debugging and troubleshooting. Sun Cluster is then a very interesting software to work on because a major part is implemented in the kernel and it interacts with many kernel components. So this makes my job very interesting and challenging.