
Friday October 09, 2009
What's New in Solaris 10 10/09
Solaris 10 10/09 (u8) is now available for download at
http://sun.com/solaris/get.jsp. DVD ISO images (full and segments that can be reassembled after download) are available for both SPARC and x86.
Here are a few of the new features in this release that caught my attention.
Packaging and Patching
Improved performance of SVR4 package commands: Improvements have been made in the SVR4 package commands (pkgadd, pkgrm, pkginfo et al). The impact of these can be seen in drastically reduced zone installation time. How much of an improvement you ask (and you know I have to answer with some data, right) ?
# cat /etc/release; uname -a
Solaris 10 5/09 s10x_u7wos_08 X86
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 30 March 2009
SunOS chapterhouse 5.10 Generic_141415-09 i86pc i386 i86pc
# time zoneadm -z zone1 install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <2905> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1453> packages on the zone.
Initialized <1453> packages on zone.
Zone is initialized.
Installation of these packages generated errors:
The file contains a log of the zone installation.
real 5m48.476s
user 0m45.538s
sys 2m9.222s
# cat /etc/release; uname -a
Solaris 10 10/09 s10x_u8wos_08a X86
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 16 September 2009
SunOS corrin 5.10 Generic_141445-09 i86pc i386 i86pc
# time zoneadm -z zone1 install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <2915> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1432> packages on the zone.
Initialized <1432> packages on zone.
Zone is initialized.
Installation of these packages generated errors:
The file contains a log of the zone installation.
real 3m4.677s
user 0m44.593s
sys 0m48.003s
OK, that's pretty impressive. A zone installation on Solaris 10 10/09 takes about half of the time as it does on Solaris 10 5/09. It is also worth noting the rather large reduction in the amount of system time (48 seconds vs 129 seconds) too.
Zones parallel patching: Before Solaris 10 10/09 the patching process was single threaded which could lead to prolonged patching time on a system with several nonglobal zones. Starting with this update you can specify the number of threads to be used to patch a system with zones. Enable this feature by assigning a value to num_proc in /etc/patch/pdo.conf. The maximum value is capped at 1.5 times the number of on-line CPUs, but can be limited by a lower value of num_proc.
This feature is also available by applying Solaris patches 119254-66 (SPARC) or 119255-66 (x86).
For more information on the effects of zone parallel patching, see Container Guru
Jeff Victor's excellent
Patching Zones Goes Zoom.
ZFS Enhancements
Flash archive install into a ZFS root filesystem: ZFS support for the root file system was introduced in Solaris 10 10/08 but the install tools did not work with flash archives. Solaris 10 10/09 provides
the ability to install a flash archive created from an existing ZFS root system. This capability is also provided by patches 119534-15 + 124630-26 (SPARC) or 119535-15 + 124631-27 (x86) that can be applied
to a Solaris 10 10/08 or later system. There are still a few limitations such as the the flash source must be from a ZFS root system and you cannot use differential archives. More information can be
found in
Installing a ZFS Root File System (Flash Archive Installation).
Set ZFS properties on the initial zpool file system: Prior to Solaris 10 10/09, ZFS file system properties could only be set once the initial file system was created. This would make it impossible to create a pool with same name as an existing mounted file system or to be able to have replication or compression from the time the pool is created. In Solaris 10 10/09 you can specify any ZFS file system property using zpool -O.
zpool create -O mountpoint=/data,copies=3,compression=on datapool c1t1d0 c1t2d0
ZFS Read Cache (L2ARC): You now have the ability to add persistent read ahead caches to a ZFS zpool. This can improve the read performance of ZFS as well as reducing the ZFS memory footprint.
L2ARC devices are added as
cache vdevs to a pool. In the following example we will create a pool of 2 mirrored devices, 2 cache devices and a spare.
# zpool create datapool mirror c1t1d0 c1t2d0 cache c1t3d0 c1t4d0 spare c1t5d0
# zpool status datapool
pool: datapool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
datapool ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
cache
c1t3d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
spares
c1t5d0 AVAIL
errors: No known data errors
So what do ZFS cache devices do ? Rather than go into a lengthy explanation of the L2ARC, I would rather refer you to Fishworks developer
Brendan Gregg's excellent
treatment of the subject.
Unlike the intent log (ZIL), L2ARC cache devices can be added and removed dynamically.
# zpool remove datapool c1t3d0
# zpool remove datapool c1t4d0
# zpool status datapool
pool: datapool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
datapool ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
spares
c1t5d0 AVAIL
errors: No known data errors
# zpool add datapool cache c1t3d0
# zpool status datapool
pool: datapool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
datapool ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
cache
c1t3d0 ONLINE 0 0 0
spares
c1t5d0 AVAIL
errors: No known data errors
New cache control properties: Two new ZFS properties are introduced with Solaris 10 10/09. These control what what is stored (nothing, data + metadata, or metadata only) in the ARC (memory) and L2ARC (external) caches. These new properties are
- primarycache - controls what is stored in the memory resident ARC cache
- secondarycache - controls what is stored in the L2ARC
and they can take the values
- none - the caches are not used
- metadata - only file system metadata is cached
- all - both file system data and the metadata is stored in the associated cache
# zpool create -O primarycache=metadata -O secondarycache=all datapool c1t1d0 c1t2d0 cache c1t3d0
There are workloads such as databases that perform better or make more efficient use of memory if the system is not competing with the caches that the applications are maintaining themselves.
User and group quotas:ZFS has always had quotas and reservations but they were applied at the file system level. To achieve user or group quotas would require
creating additional file systems which might make administration more complex. Starting with Solaris 10 10/09 you can apply both user and group quotas to a file system
much like you would with UFS. The ZFS file system must be at version 15 or later and the zpool must be at version 4 or later.
Let's create a file system and see if we are at the proper versions to set quotas.
# zfs create rpool/newdata
# chown bobn:local /rpool/newdata
# zpool get version rpool
NAME PROPERTY VALUE SOURCE
rpool version 18 default
# zpool upgrade -v
This system is currently running ZFS pool version 18.
The following versions are supported:
VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 snapshot user holds
For more information on a particular version, including supported releases, see:
http://www.opensolaris.org/os/community/zfs/version/N
Where 'N' is the version number.
# zfs get version rpool/newdata
NAME PROPERTY VALUE SOURCE
rpool/newdata version 4
# zfs upgrade -v
The following filesystem versions are supported:
VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS filesystem version
2 Enhanced directory entries
3 Case insensitive and File system unique identifier (FUID)
4 userquota, groupquota properties
For more information on a particular version, including supported releases, see:
http://www.opensolaris.org/os/community/zfs/version/zpl/N
Where 'N' is the version number.
Excellent. Now let's set a user and group quota and see what happens. We'll set a group quota of 1GB and a user quota at 2GB.
# zfs set groupquota@local=1g rpool/newdata
# zfs set userquota@bobn=2g rpool/newdata
# su - bobn
% mkfile 500M /rpool/newdata/file1
% mkfile 500M /rpool/newdata/file2
% mkfile 500M /rpool/newdata/file3
file3: initialized 40370176 of 524288000 bytes: Disc quota exceeded
As expected, we have exceeded our group quota. Let's change the group of the existing files and see if we can proceed to our user quota.
% rm /rpool/newdata/file3
% chgrp sales /rpool/newdata/file1 /rpool/newdata/file2
% mkfile 500m /rpool/newdata/file3
Could not open /rpool/newdata/disk3: Disc quota exceeded
Whoa! What's going on here ? Relax - ZFS does things asynchronously unless told otherwise. And we should have noticed this when the mkfile for file3 actually started. ZFS wasn't quite caught up with the current usage. A good sync should do the trick.
% sync
% mkfile 500M /rpool/newdata/file3
% mkfile 500M /rpool/newdata/file4
% mkfile 500M /rpool/newdata/file5
/rpool/newdata/disk5: initialized 140247040 of 524288000 bytes: Disc quota exceeded
Great. We now have user and group quotas. How can I find out what I have used against my quota ?
There are two new ZFS properties, userused and groupused that will show what the group or user is currently consuming.
% zfs get userquota@bobn,userused@bobn rpool/newdata
NAME PROPERTY VALUE SOURCE
rpool/newdata userquota@bobn 2G local
rpool/newdata userused@bobn 1.95G local
% zfs get groupquota@local,groupused@local rpool/newdata
NAME PROPERTY VALUE SOURCE
rpool/newdata groupquota@local 1G local
rpool/newdata groupused@local 1000M local
% zfs get groupquota@sales,groupused@sales rpool/newdata
NAME PROPERTY VALUE SOURCE
rpool/newdata groupquota@sales none local
rpool/newdata groupused@sales 1000M local
% zfs get groupquota@scooby,groupused@scooby rpool/newdata
NAME PROPERTY VALUE SOURCE
rpool/newdata groupquota@scooby - -
rpool/newdata groupused@scooby -
New space usage properties: Four new usage properties have been added to ZFS file systems.
- usedbychildren (usedchild) - this is the amount of space that is used by all of the children of the specified dataset
- usedbydataset (usedds) - this is the total amount of space that would be freed if this dataset and it's snapshots and reservations were destroyed
- usedbyrefreservation (usedrefreserv) - this is the amount of space that would be freed if the dataset's reservations were to be removed
- usertbysnapshots (usedsnap) - the total amount of space that would be freed if all of the snapshots of this dataset were deleted.
# zfs get all datapool | grep used
datapool used 5.39G -
datapool usedbysnapshots 19K -
datapool usedbydataset 26K -
datapool usedbychildren 5.39G -
datapool usedbyrefreservation 0 -
These new properties can also be viewed in a nice tabular form using
zfs list -o space.
# zfs list -r -o space datapool
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
datapool 480M 5.39G 19K 26K 0 5.39G
datapool@now - 19K - - - -
datapool/fs1 480M 400M 0 400M 0 0
datapool/fs2 1.47G 1.00G 0 1.00G 0 0
datapool/fs3 480M 21K 0 21K 0 0
datapool/fs4 2.47G 0 0 0 0 0
datapool/vol1 1.47G 1G 0 16K 1024M 0
Miscellaneous
Support for 2TB boot disks: Solaris 10 10/09 supports a disk Volume Table of Contents (VTOC) of up to 2TB in size. The previous maximum VTOC size was 1TB. On x86 systems you must be running Solaris with a 64bit kernel and have at least 1GB of memory to use a VTOC larger that 1TB.
pcitool: A new command for Solaris that can assign interrupts to specific threads or display the current interrupt routing. This command is available for both SPARC and x86.
New iSCSI initiator SMF service: svc:/network/iscsi/initiator:default is a new Service Management Facility (SMF) service to control discovery and enumeration of iSCSI devices early in the boot process. Other boot services that may require iSCSI services can add dependencies to insure that the devices are available before being needed.
Device Drivers
The following device drivers are either new to Solaris or have had some new features or chipsets added.
- MPxIO support for the LSI 6180 Controller
- LSI MPT 2.0 SAS 2.0 controllers (mpt_sas)
- Broadcom NetXTreme II gigabit Ethernet (bcm5716c and bcm5716s) controllers
- Interrupt remapping for Intel VT-x enabled processors
- Support for SATA AHCI tape
- Sun StorageTek 6Gb/s SAS RAID controller and LSI MegaRAID 92xx (mt_sas)
- Intel 82598 and 82599 10Gb/s PCIe Ethernet controller
Open Source Software Updates
The following open source packages have been updated for Solaris 10 10/09.
- NTP 4.2.5
- PostgreSQL versions 8.1.17, 8.2.13 and 8.3.7
- Samba 3.0.35
For more information
A complete list of new features and changes can be found in the
Solaris 10 10/09 Release Notes and the
What's New in Solaris 10 10/09 documentation at
docs.sun.com.
Technocrati Tags:
Sun
Solaris
patching
zfs

Thursday May 21, 2009
Getting Rid of Pesky Live Upgrade Boot Environments
As we
discussed earlier, Live Upgrade can solve most of the problems associated with
patching and upgrading your Solaris system. I'm not quite ready to post the next
installment in the LU series quite yet, but from some of the comments and email I
have received, there are two problems that I would like to help you work around.
Oh where oh where did that file system go ?
One thing you can do to stop Live Upgrade in its tracks is to remove a file system that it thinks another boot environment needs.
This does fall into the category of user error, but you are more likely to run into this in a ZFS world where file systems can
be created and destroyed with great ease. You will also run into a varient of this if you change your zone configurations
without recreating your boot environment, but I'll save that for a later day.
Here is our simple test case:
- Create a ZFS file system.
- Create a new boot environment.
- Delete the ZFS file system.
- Watch Live Upgrade fail.
# zfs create arrakis/temp
# lucreate -n test
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
Comparing source boot environment <s10u7-baseline> file systems with the
file system(s) you specified for the new boot environment. Determining
which file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <test>.
Source boot environment is <s10u7-baseline>.
Creating boot environment <test>.
Cloning file systems from boot environment <s10u7-baseline> to create boot environment <test>.
Creating snapshot for <rpool/ROOT/s10u7-baseline> on <rpool/ROOT/s10u7-baseline@test>.
Creating clone for <rpool/ROOT/s10u7-baseline@test> on <rpool/ROOT/test>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/test>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <s10u6_baseline> as <mount-point>>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <test> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <nv114> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <route66> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <nv95> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <test> in GRUB menu
Population of boot environment <test> successful.
Creation of boot environment <test> successful.
# zfs destroy arrakis/test
# luupgrade -t -s /export/patches/10_x86_Recommended-2009-05-14 -O "-d" -n test
System has findroot enabled GRUB
No entry for BE <test> in GRUB menu
Validating the contents of the media </export/patches/10_x86_Recommended-2009-05-14>.
The media contains 143 software patches that can be added.
All 143 patches will be added because you did not specify any specific patches to add.
Mounting the BE <test>.
ERROR: Read-only file system: cannot create mount point </.alt.tmp.b-59c.mnt/arrakis/temp>
ERROR: failed to create mount point </.alt.tmp.b-59c.mnt/arrakis/temp> for file system </arrakis/temp>
ERROR: unmounting partially mounted boot environment file systems
ERROR: cannot mount boot environment by icf file </etc/lu/ICF.5>
ERROR: Unable to mount ABE <test>: cannot complete lumk_iconf
Adding patches to the BE <test>.
Validating patches...
Loading patches installed on the system...
Cannot check name /a/var/sadm/pkg.
Unmounting the BE <test>.
The patch add to the BE <test> failed (with result code <1>).
The proper Live Upgrade solution to this problem would be to destroy and recreate the boot environment, or just recreate the missing file system (I'm sure that most of you have figured the latter part out on your own). The rationale is that
the alternate boot environment no longer matches the storage configuration of its source. This was fine in a UFS world,
but perhaps a bit constraining when ZFS rules the landscape. What if you really wanted
the file system to be gone forever.
With a little more understanding of the internals of Live Upgrade, we can fix this rather easily.
Important note: We are about to modify undocumented Live Upgrade configuration files. The formats, names, and contents are subject to change without notice and any errors made while doing this can render your Live Upgrade configuration unusable.
The file system configurations for each boot environment are kept in a set of Internal Configuration Files (ICF) in /etc/lu named ICF.n, where n is the boot environment number. From the error message above we see that /etc/lu/ICF.5 is the one that is causing the problem. Let's take a look.
# cat /etc/lu/ICF.5
test:-:/dev/dsk/c5d0s1:swap:4225095
test:-:/dev/zvol/dsk/rpool/swap:swap:8435712
test:/:rpool/ROOT/test:zfs:0
test:/archives:/dev/dsk/c1t0d0s2:ufs:327645675
test:/arrakis:arrakis:zfs:0
test:/arrakis/misc:arrakis/misc:zfs:0
test:/arrakis/misc2:arrakis/misc2:zfs:0
test:/arrakis/stuff:arrakis/stuff:zfs:0
test:/arrakis/temp:arrakis/temp:zfs:0
test:/audio:arrakis/audio:zfs:0
test:/backups:arrakis/backups:zfs:0
test:/export:arrakis/export:zfs:0
test:/export/home:arrakis/home:zfs:0
test:/export/iso:arrakis/iso:zfs:0
test:/export/linux:arrakis/linux:zfs:0
test:/rpool:rpool:zfs:0
test:/rpool/ROOT:rpool/ROOT:zfs:0
test:/usr/local:arrakis/local:zfs:0
test:/vbox:arrakis/vbox:zfs:0
test:/vbox/fedora8:arrakis/vbox/fedora8:zfs:0
test:/video:arrakis/video:zfs:0
test:/workshop:arrakis/workshop:zfs:0
test:/xp:/dev/dsk/c2d0s7:ufs:70396830
test:/xvm:arrakis/xvm:zfs:0
test:/xvm/fedora8:arrakis/xvm/fedora8:zfs:0
test:/xvm/newfs:arrakis/xvm/newfs:zfs:0
test:/xvm/nv113:arrakis/xvm/nv113:zfs:0
test:/xvm/opensolaris:arrakis/xvm/opensolaris:zfs:0
test:/xvm/s10u5:arrakis/xvm/s10u5:zfs:0
test:/xvm/ub710:arrakis/xvm/ub710:zfs:0
The first step is to clean up the mess left by the failing luupgrade attempt. At the very least
we will need to unmount the alternate boot environment root. It is also very likely that we
will have to unmount a few temporary directories, such as /tmp and /var/run. Since this is ZFS
we will also have to remove the directories created when these file systems were mounted.
# df -k | tail -3
rpool/ROOT/test 49545216 6879597 7546183 48% /.alt.tmp.b-Fx.mnt
swap 4695136 0 4695136 0% /a/var/run
swap 4695136 0 4695136 0% /a/tmp
# luumount test
# umount /a/var/run
# umount /a/tmp
# rmdir /a/var/run /a/var /a/tmp
Next we need to remove the missing file system entry from the current copy of the ICF file. Use
whatever method you prefer (vi, perl, grep). Once we have corrected our local copy of the ICF file
we must propagate it to the alternate boot environment we are about to patch. You can skip the propagation if you are
going to delete the boot environment without doing any other maintenance activities. The normal
Live Upgrade operations will take care of propagating the ICF files to the other boot environments, so
we should not have to worry about them at this time.
# mv /etc/lu/ICF.5 /tmp/ICF.5
# grep -v arrakis/temp /tmp/ICF.5 > /etc/lu/ICF.5
# cp /etc/lu/ICF.5 `lumount test`/etc/lu/ICF.5
# luumount test
At this point we should be good to go. Let's try the luupgrade again.
# luupgrade -t -n test -O "-d" -s /export/patches/10_x86_Recommended-2009-05-14
System has findroot enabled GRUB
No entry for BE in GRUB menu
Validating the contents of the media .
The media contains 143 software patches that can be added.
All 143 patches will be added because you did not specify any specific patches to add.
Mounting the BE <test>.
Adding patches to the BE <test>.
Validating patches...
Loading patches installed on the system...
Done!
Loading patches requested to install.
Approved patches will be installed in this order:
118668-19 118669-19 119214-19 123591-10 123896-10 125556-03 139100-02
Checking installed patches...
Verifying sufficient filesystem capacity (dry run method)...
Installing patch packages...
Patch 118668-19 has been successfully installed.
Patch 118669-19 has been successfully installed.
Patch 119214-19 has been successfully installed.
Patch 123591-10 has been successfully installed.
Patch 123896-10 has been successfully installed.
Patch 125556-03 has been successfully installed.
Patch 139100-02 has been successfully installed.
Unmounting the BE <test>.
The patch add to the BE <test> completed.
Now that the alternate boot environment has been patched, we can activate it at our convenience.
I keep deleting and deleting and still can't get rid of those pesky boot environments
This is an interesting corner case where the Live Upgrade configuration files get so scrambled that
even simple tasks like deleting a boot environment are not possible. Every time I have gotten myself
into this situation I can trace it back to some ill advised shortcut that seemed harmless at the time,
but I won't rule out bugs and environment as possible causes.
Here is our simple test case: turn our boot environment from the previous example into a zombie - something that is neither alive nor dead but just takes up space and causes a mild annoyance.
Important note: Don't try this on a production system. This is for demonstration purposes only.
# dd if=/dev/random of=/etc/lu/ICF.5 bs=2048 count=2
0+2 records in
0+2 records out
# ludelete -f test
System has findroot enabled GRUB
No entry for BE <test> in GRUB menu
ERROR: The mount point </.alt.tmp.b-fxc.mnt> is not a valid ABE mount point (no /etc directory found).
ERROR: The mount point </.alt.tmp.b-fxc.mnt> provided by the <-m> option is not a valid ABE mount point.
Usage: lurootspec [-l error_log] [-o outfile] [-m mntpt]
ERROR: Cannot determine root specification for BE <test>.
ERROR: boot environment <test> is not mounted
Unable to delete boot environment.
Our first task is to make sure that any partially mounted boot environment is cleaned up. A df should
help us here.
# df -k | tail -5
arrakis/xvm/opensolaris 350945280 19 17448377 1% /xvm/opensolaris
arrakis/xvm/s10u5 350945280 19 17448377 1% /xvm/s10u5
arrakis/xvm/ub710 350945280 19 17448377 1% /xvm/ub710
swap 4549680 0 4549680 0% /.alt.tmp.b-fxc.mnt/var/run
swap 4549680 0 4549680 0% /.alt.tmp.b-fxc.mnt/tmp
# umount /.alt.tmp.b-fxc.mnt/tmp
# umount /.alt.tmp.b-fxc.mnt/var/run
Ordinarily you would use
lufslist(1M) to try to determine which file systems are in use by the boot environment you are trying to delete.
In this worst case scenario that is not possible. A bit of forensic investigation and a bit more courage will help us figure this out.
The first place we will look is /etc/lutab. This is the configuration file that lists all boot environments known to Live Upgrade. There is a man
page for this in section 4, so it is somewhat of a public interface but please take note of the warning
The lutab file must not be edited by hand. Any user modifi-
cation to this file will result in the incorrect operation
of the Live Upgrade feature.
This is very good advice and failing to follow it has led some some of my most spectacular Live Upgrade meltdowns. But in this case Live Upgrade is
already broken and it may be possible to undo the damage and restore proper operation. So let's see what we can find out.
# cat /etc/lutab
# DO NOT EDIT THIS FILE BY HAND. This file is not a public interface.
# The format and contents of this file are subject to change.
# Any user modification to this file may result in the incorrect
# operation of Live Upgrade.
3:s10u5_baseline:C:0
3:/:/dev/dsk/c2d0s0:1
3:boot-device:/dev/dsk/c2d0s0:2
1:s10u5_lu:C:0
1:/:/dev/dsk/c5d0s0:1
1:boot-device:/dev/dsk/c5d0s0:2
2:s10u6_ufs:C:0
2:/:/dev/dsk/c4d0s0:1
2:boot-device:/dev/dsk/c4d0s0:2
4:s10u6_baseline:C:0
4:/:rpool/ROOT/s10u6_baseline:1
4:boot-device:/dev/dsk/c4d0s3:2
10:route66:C:0
10:/:rpool/ROOT/route66:1
10:boot-device:/dev/dsk/c4d0s3:2
11:nv95:C:0
11:/:rpool/ROOT/nv95:1
11:boot-device:/dev/dsk/c4d0s3:2
6:s10u7-baseline:C:0
6:/:rpool/ROOT/s10u7-baseline:1
6:boot-device:/dev/dsk/c4d0s3:2
7:nv114:C:0
7:/:rpool/ROOT/nv114:1
7:boot-device:/dev/dsk/c4d0s3:2
5:test:C:0
5:/:rpool/ROOT/test:1
5:boot-device:/dev/dsk/c4d0s3:2
We can see that the boot environment named test is (still) BE #5 and has it's root file system at rpool/ROOT/test. This is the default dataset name and indicates that the boot environment has not been renamed. Consider the following example for a more complicated configuration.
# lucreate -n scooby
# lufslist scooby | grep ROOT
rpool/ROOT/scooby zfs 241152 / -
rpool/ROOT zfs 39284664832 /rpool/ROOT -
# lurename -e scooby -n doo
# lufslist doo | grep ROOT
rpool/ROOT/scooby zfs 241152 / -
rpool/ROOT zfs 39284664832 /rpool/ROOT -
The point is that we have to trust the contents of /etc/lutab but it does not hurt to do a bit of sanity checking before we start deleting ZFS datasets. To remove boot environment test from the view of Live Upgrade, delete the three lines in /etc/lutab starting with 5 (in this example). We should also remove it's Internal Configuration File (ICF) /etc/lu/ICF.5
# mv -f /etc/lutab /etc/lutab.old
# grep -v ^5: /etc/lutab.old > /etc/lutab
# rm -f /etc/lu/ICF.5
# lustatus
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
s10u5_baseline yes no no yes -
s10u5_lu yes no no yes -
s10u6_ufs yes no no yes -
s10u6_baseline yes no no yes -
route66 yes no no yes -
nv95 yes yes yes no -
s10u7-baseline yes no no yes -
nv114 yes no no yes -
If the boot environment being deleted is in UFS then we are done. Well, not exactly - but pretty close. We still need to propagate the updated configuration files to
the remaining boot environments. This will be done during the next live upgrade operation (lucreate, lumake, ludelete, luactivate) and I would recommend that you let Live
Upgrade handle this part. The exception to this will be if you boot directly into another boot environment without activating it first. This isn't a recommended practice
and has been the source of some of my most frustrating mistakes.
If the exorcised boot environment is in ZFS then we still have a little bit of work to do. We need to delete the old root datasets and any snapshots that they may have been cloned from. In our example the root dataset was rpool/ROOT/test. We need to look for any children as well as the originating snapshot, if present.
# zfs list -r rpool/ROOT/test
NAME USED AVAIL REFER MOUNTPOINT
rpool/ROOT/test 234K 6.47G 8.79G /.alt.test
rpool/ROOT/test/var 18K 6.47G 18K /.alt.test/var
# zfs get -r origin rpool/ROOT/test
NAME PROPERTY VALUE SOURCE
rpool/ROOT/test origin rpool/ROOT/nv95@test -
rpool/ROOT/test/var origin rpool/ROOT/nv95/var@test
# zfs destroy rpool/ROOT/test/var
# zfs destroy rpool/ROOT/nv95/var@test
# zfs destroy rpool/ROOT/test
# zfs destroy rpool/ROOT/nv95@test
Important note:luactivate will promote the newly activated root dataset so that snapshots used to create alternate boot environments should be easy to delete. If you are switching between boot environments without activating them first (which I have already warned you about doing), you may have to manually promote a different dataset so that the snapshots can be deleted.
To BE or not to BE - how about no BE ?
You may find yourself in a situation where you have things so scrambled up that you want to start all over again. We can use what we have just learned to unwind Live Upgrade and start from a clean configuration. Specifically we want to delete /etc/lutab, the ICF and related files, all of the temporary files in /etc/lu/tmp and a few files that hold environment variables for some of the lu scripts. And if using ZFS we will also have to delete any datasets and snapshots that are no longer needed.
# rm -f /etc/lutab
# rm -f /etc/lu/ICF.* /etc/lu/INODE.* /etc/lu/vtoc.*
# rm -f /etc/lu/.??*
# rm -f /etc/lu/tmp/*
# lustatus
ERROR: No boot environments are configured on this system
ERROR: cannot determine list of all boot environment names
# lucreate -c scooby -n doo
Checking GRUB menu...
Analyzing system configuration.
No name for current boot environment.
Current boot environment is named <scooby>.
Creating initial configuration for primary boot environment <scooby>.
The device </dev/dsk/c4d0s3> is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name <scooby> PBE Boot Device </dev/dsk/c4d0s3>.
Comparing source boot environment <scooby> file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <doo>.
Source boot environment is <scooby>.
Creating boot environment <doo>.
Cloning file systems from boot environment <scooby> to create boot environment <doo>.
Creating snapshot for <rpool/ROOT/scooby> on <rpool/ROOT/scooby@doo>.
Creating clone for <rpool/ROOT/scooby@doo> on <rpool/ROOT/doo>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/doo>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <doo> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <doo> in GRUB menu
Population of boot environment <doo> successful.
Creation of boot environment <doo> successful.
# luactivate doo
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE
File deletion successful
File deletion successful
File deletion successful
Activation of boot environment successful.
# lustatus
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
scooby yes yes no no -
doo yes no yes no -
Pretty cool, eh ?
There are still a few more interesting corner cases, but we will deal with those in the one of the next articles. In the mean time, please remember to
- Check Infodoc 206844 for Live Upgrade patch requirements
- Keep your patching and package utilities updated
- Use luactivate to switch between boot environments
Technocrati Tags:
Sun
Solaris
patching
liveupgrade

Tuesday March 24, 2009
Adobe releases an x86 version of Acroread 9.1 for Solaris
 |
Great Googly Moogly!!! Our friends at Adobe have finally released a new x86 version of Acroread for Solaris. Download Acroread 9.1 from Adobe.com and say goodbye to evince, xpdf, and the especially interesting Acroread out of the Linux branded zone trick.
|
| Click image to enlarge |   |

Monday March 23, 2009
Dr. Live Upgrade - Or How I Learned to Stop Worrying and Love Solaris Patching
Who loves to patch or upgrade a system ?
That's right, nobody. Or if
you do perhaps we should start a local support group to help you come
to terms with this unusual fascination. Patching, and to a lesser extent
upgrades (which can be thought of as patches delivered more efficiently
through package replacement), is the the most common complaint that I
hear when meeting with system administrators and their management.
Most of the difficulties seem to fit into one of the following
categories.
- Analysis: What patches need to be applied to my system ?
- Effort: What do I have to do to perform the required maintenance ?
- Outage: How long will the system be down to perform the maintenance ?
- Recovery: What happens when something goes wrong ?
And if a single system gives you a headache, adding a few containers into
the mix will bring on a full migraine. And without some relief you may be left
with the impression that containers aren't worth the effort. That's
unfortunate because containers don't have to be troublesome and patching doesn't
have to be hard. But it does take getting to know one of the most important
and sadly least used features in Solaris:
Live Upgrade
Before we looking at Live Upgrade, let's start with a definition. A
boot environment is
the set of all file systems and devices that are unique to an instance of Solaris on a system.
If you have several boot environments then some data will be shared (non svr4 package installed
applications, data, local home directories) and some will be
exclusive to one boot environment. Not making this more complicated than it needs to be,
a boot environment is generally your root (including /usr and /etc), /var (frequently split out on
a separate file system), and /opt. Swap may or may not be
a part of a boot environment - it is your choice. I prefer to share swap, but there are some
operational situations where this may not be feasible. There may be additional items, but
generally everything else is shared. Network mounted file systems and removable media are assumed
to be shared.
With this definition behind us, let's proceed.
Analysis: What patches need to be applied to my system ?
For all of the assistance that Live Upgrade offers,
it doesn't do anything to help with the analysis phase. Fortunately there
are plenty of tools that can help with this phase. Some of them work nicely
with Live Upgrade, others take a bit more effort.
smpatch(1M) has an analyze capability that can determine which patches need
to be applied to your system. It will get a list of patches from an update
server, most likely one at Sun, and match up the dependencies and requirements
with your system. smpatch can be used to download these patches for future
application or it can apply them for you. smpatch works nicely with Live Upgrade,
so from a single command you can upgrade an alternate boot environment. With containers!
The
Sun Update Manager is a simple to use graphical front end for smpatch. It
gives you a little more flexibility during the inspection phase by allowing
you to look at individual patch README files. It is also much easier to see
what collection a patch belongs to (recommended, security, none) and if the
application of that patch will require a reboot. For all of that additional
flexibility you lose the integration with Live Upgrade. Not for lack of
trying, but I have not found a good way to make Update Manager and Live
Upgrade play together.
Sun xVM Ops Center has a much more sophisticated patch analysis system that
uses additional knowledge engines beyond those used by smpatch and Update
Manager. The result is a higher quality patch bundle tailored for each
individual system, automated deployment of the patch bundle, detailed auditing
of what was done and simple backout should problems occur. And it basically
does the same for Windows and Linux. It is this last feature that makes things
interesting. Neither Windows nor Linux have anything like Live Upgrade and
the least common denominator approach of Ops Center in its current state means
that it doesn't work with Live Upgrade. Fortunately this will change in the
not too distant future, and when it does I will be shouting about this feature
from rooftops (OK, what I really mean is I'll post a blog and a tweet about
it). If I can coax Ops Center into doing the analysis and download pieces then
I can manually bolt it onto Live Upgrade for a best of both worlds solution.
These are our offerings and there are others. Some of them are quite good and in use
in many places.
Patch Check Advanced (PCA) is one of the more common tools in
use. It operates on a patch dependency cross reference file and does a good job with the
dependency analysis (this is obsoleted by that, etc). It can be used to maintain an
alternate boot environment and in simple cases that would be fine. If the alternate
boot environment contains any containers then I would use Live Upgrade's luupgrade
instead of PCA's patchadd -R approach. If I was familiar with PCA then I would still
use it for the analysis and download feature. Just let luupgrade apply the patches.
You might have to uncompress the patches downloaded by PCA before handing them over
to luupgrade, but that is a minor implementation detail.
In summary, use an analysis tool appropriate to the task (based on familiarity, budget
and complexity) to figure out what patches are needed. Then use Live Upgrade (luupgrade)
to deploy the desired patches.
Effort: What does it take to perform the required maintenance ?
This is a big topic and I could write pages on the subject. Even if I use an analysis
tool like smpatch or pca to save me hours of trolling through READMEs drawing dependency
graphs, there is still a lot of work to do in order to survive the ordeal of applying
patches. Some of the more common techniques include ....
Backing up your boot environment.
I should not have to mention this, but there are some
operational considerations unique to system maintenance. Even though tiny, there is
a greater chance that you will render your system non-bootable during system maintenance
than any other operational task. Even with mature processes, human factors can come into
play and bad things can happen (oops - that was my fallback boot environment that I just
ran newfs(1M) on).
This is why automation and time tested scripting becomes so important.
Should you do the unthinkable and render a system nonfunctional, rapid restoration of the
boot environment is important. And getting it back to the last known good state is just
as important. A fresh backup that can be restored by utilities from install media or
jumpstart miniroot is a very good idea. Flash archives (see
flarcreate(1M)) is even better,
although
complications with containers make this less interesting now than in previous
releases of Solaris. How many of you take a backup before applying patches ? Probably
about the same number as replace batteries in your RAID controllers or change out
your UPS systems after their expiration date.
Split Mirrors
One interesting technique is to split mirrors instead of backups. Of course this only works
if you mirror your boot environment (a recommended practice for those systems with adequate
disk space). Break your mirror, apply patches to the non-running half, cut over the
updated boot environment during the next maintenance window and see how this goes. At
first glance this seems like a good idea, but there are two catches.
- Do you synchronize dynamic boot environment elements ? Things like /etc/passwd,
/etc/shadow, /var/adm/messages, print and mail queues are constantly changing. It is
possible that these have changed between the mirror split and subsequent activation.
- How long are you willing to run without your boot environment being mirrored ?
This may cause to you certify the new boot environment too quickly. You want to
reestablish your mirror, but if that is your fallback in case of trouble you have
a conundrum. And if you are the sort that seems to have a black cloud following you
through life, you will discover a problem shortly after you started the mirror resync.
Pez disks ?
OK, the mirror split thing can be solved by swinging in another disk. Operationally a
bit more complex and you have at least one disk that you can't use for other purposes
(like hosting a few containers), but it can be done. I wouldn't do it (mainly because I
know where this story is heading) but many of you do.
Better living through Live Upgrade
Everything we do to try to make it better adds complexity, or another hundred lines of
scripting. It doesn't need to be this way, and if you become one with the LU commands
it won't for you either. Live Upgrade will take care building and updating multiple
boot environments. It will check to make sure the disks being used are bootable
and not part of another boot environment. It works with the Solaris Volume
Manager, Veritas encapulated root devices, and starting with Solaris 10 10/08
(update 6) ZFS. It also takes care of the synchronization
problem. Starting with Solaris 10 8/07 (update 4), Live Upgrade also works
with containers, both native and branded (and with Solaris 10 10/08 your zoneroots
can be in a ZFS pool).
Outage: How long will my system be down for the maintenance?
Or perhaps more to the point, how long will my applications be unavailable ? The
proper reply is it depends on how big the patch bundle is and how many containers
you have. And if a kernel patch is involved, double or triple your estimate.
This can be a big problem and cause you to take short cuts like only install some
patches now and others later when it is more convenient. Our good friend
Bart
Smaalders has a nice discussion on the
implications of this approach and
what we
are doing in OpenSolaris to solve this. That solution will eventually work its
way into the Next Solaris, but in the mean time we have a problem to solve.
There is a large set (not really large, but more than one) of patches that require
a quiescent system to be properly applied. An example would be a kernel patch
that causes a change to libc. It is sort of hard to rip out libc on a running
system (new processes get the new libc my may have issues with the running
kernel, old processes get the old libc and tend to be fine, until they do a
fork(2) and exec(2)). So we developed a brilliant solution to this problem -
deferred activation patching. If you apply one of these troublesome patches
then we will throw it in a queue to be applied the next time the system is
quiesced (a fancy term for the next time we're in single user mode). This solves the
current system stability concerns but may make the next reboot take a bit longer.
And if you forgot you have deferred patches in your queue, don't get anxious and
interrupt the shutdown or next boot. Grab a noncaffeinated beverage and
put some Bobby McFerrin on your iPod. Don't Worry, Be Happy.
So deferred activation patching seems like a good way to deal with situation
where everything goes well. And some brilliant engineers are working on applying
patches in parallel (where applicable) which will make this even better. But what happens when things go wrong ? This is
when you realize that
patchrm(1M) is not your friend. It has never been your
friend, nor will it ever be. I have an almost paralyzing fear of dentists, but would rather visit one then start
down a path where patchrm is involved. Well tested tools and some automation can reduce this to simple
anxiety, but if I could eliminate patchrm altogether I would be much happier.
For all that Live Upgrade can do to ease system maintenance, it is in the area of outage and
recovery that make it special. And when speaking about Solaris, either in
training or evangelism events, this is why I urge attendees to drop whatever they are doing
and adopt Live Upgrade immediately.
Since Live Upgrade (lucreate, lumake, luupgrade) operates on an alternate boot environment,
the currently running set of applications are not affected. The system stays up, applications
stay running and nothing is changing underneath them so there is no cause for concern. The only
impact is some additional load by the live upgrade operations. If that is a concern then run
live upgrade in a project and cap resource consumption to that project.
An interesting implication of Live Upgrade is that the operational sanity of each step is no
longer required. All that matters is the end state. This gives us more freedom to apply patches
in a more efficient fashion than would be possible on a running boot environment. This is
especially noticeable on a system with containers. The time that the upgrade runs is significantly
reduced, and all the while applications are running. No more deferred activation patches, no more
single user mode patching. And if all goes poorly after activating the new boot environment you
still have your old one to fall back on. Queue Bobby McFerrin for another round of "Don't Worry, Be Happy".
This brings up another feature of Live Upgrade - the synchronization of system files in flight
between boot environments. After a boot environment is activated, a synchronization process
is queued as a K0 script to be run during shutdown. Live Upgrade will catch a lot of private
files that we know about and the obvious public ones (/etc/passwd, /etc/shadow, /var/adm/messages,
mail queues). It also provides a place (/etc/lu/synclist) for you to include things we might not
have thought about or are unique to your applications.
When using Live Upgrade applications are only unavailable for the amount of time it takes
to shut down the system (the synchronization process) and boot the new boot environment.
This may include some minor SMF manifest importing but that should not add much
to the new boot time. You only have to complete the restart during a maintenance window,
not the entire upgrade. While vampires are all the rage for teenagers these days,
system administrators can now come out into the light and work regular hours.
Recovery: What happens when something goes wrong?
This is when you will fully appreciate Live Upgrade. After activation of a new
boot environment, now called the Primary Boot Environment (PBE), your old boot
environment, now called an Alternate Boot Environment (ABE) can still be called upon in case of trouble.
Just activate it and shut down the system. Applications will be down for a short period
(the K0 sync and subsequence start up), but there will be no more wringing of the hands,
reaching for beverages with too much caffeine and vitamin B12, trying to remember where you kept your bottle
of Tums. Queue Bobby McFerrin one more timne and "Don't Worry, Be Happy". You will be back to your previous operational
state in a matter of a few minutes (longer if you have a large server with many disks).
Then you can mount up your ABE and troll through the logs trying to determine what went wrong.
If you have a service contract then we will troll through the logs with you.
I neglected to mention earlier, disks that comprise boot environments can be mirrored,
so there is no rush to certification. Everything can be mirrored, at all times. Which is a
very good thing. You still need to back up your boot environments, but you will find yourself
reaching for the backup media much less often when using Live Upgrade.
All that is left are a few simple examples of how to use Live Upgrade. I'll save that for next
time.
Technocrati Tags:
Sun
Solaris
patching
liveupgrade

Tuesday November 11, 2008
OpenSolaris 2008.11 Release Candidate 1B (nv101a) is now available for testing
The initial release candidate (rc1b) for OpenSolaris 2008.11 (based on nv101a) is now
available
for download and testing. Additional (larger) images are available for non-English locales
as well as USB images for faster installs. If you have not played with a USB
image you will be dazzled at the speed of the installation. Amazing what happens
when you eliminate all those slow seeks.
The new release candidate has quite a few interesting features and updates. The
items that caught my attention were
- IPS Package Manager
- Automatically cloning root file system (beadm clone) during image update
- GNOME 2.24
- Evolution 2.24 for those of us that are stubborn enough to continue using it
- OpenOffice 3.0
- Songbird - an iTunes-like media player. Still needs lots of codecs (like the free Fluendo MP3 decoder) to be really useful
- Brasero - a Nero-like media burner
Our own Dan Roberts has more to say on the subject
in this video podcast.
Using the graphical package manager it only took a few minutes to set up the installation
plan for a nice web based development system including Netbeans, a web stack (including
Glassfish), and a Xen based virtualization system.
OpenSolaris 2008.11 is shaping up to be quite a nice release. Now that I have figured out how
to make it play nicely in a root zpool with other Solaris releases, I will be spending a lot more
time with it as the daily driver.
Download it, play with it, and please remember to
file bugs
when you run into things that don't work.
Technocrati Tags:
Sun
Solaris
OpenSolaris

Tuesday November 04, 2008
Solaris and OpenSolaris coexistence in the same root zpool
Some time ago, my buddy
Jeff Victor gave us
FrankenZone. An idea that is disturbingly brilliant. It has taken me a while, but I offer for your consideration VirtualBox as a V2P platform for
OpenSolaris. Nowhere near as brilliant, but at least as unusual. And you know that you have to try this out at home.
Note: This is totally a science experiment. I fully expect to see the two guys from
Myth Busters showing up at any moment. It also requires at least build 100 of OpenSolaris on both the host and guest operating system to work around the hostid difficulties.
With the caveats out of the way, let me set the back story
to explain how I got here.
Until virtualization technologies become ubiquitous and nothing more than BIOS extensions, multi-boot
configurations will continue to be an important capability. And for those working with [Open]Solaris
there are several limitations that complicate this unnecessarily. Rather than lamenting these, the
possibility of leveraging ZFS root pools, now in
Solaris 10 10/08, should offer up
some interesting solutions.
What I want to do is simple - have a single Solaris fdisk partition that can have multiple versions
of Solaris all bootable with access to all of my data. This doesn't seem like much of a request, but as of yet this
has been nearly impossible to accomplish in anything close to a supportable configuration.
As it turns out the essential limitation is in the
installer -
all other issues can be handled if we can figure out how to install
OpenSolaris into an existing pool.
What we will do is use our friend
VirtualBox to work around the
installer issues. After installing OpenSolaris in a virtual machine we take a ZFS snapshot,
send it to the bare metal Solaris host and restore it in the root pool. Finally we fix up a few configuration files
to make everything work and we will be left with a single root pool that can boot Solaris 10,
Solaris Express Community Edition (nevada), and OpenSolaris.
How cool is that :-) Yeah, it is that cool. Let's proceed.
Prepare the host system
The host system is running a fresh install of Solaris 10 10/08 with a single large root
zpool. In this example the root zpool is named panroot. There is also a separate zpool
that contains data that needs to be preserved in case a re-installation of Solaris is required.
That zpool is named pandora, but it doesn't matter - it will be automatically imported
in our new OpenSolaris installation if all goes well.
# lustatus
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
s10u6_baseline yes no no yes -
s10u6 yes no no yes -
nv95 yes yes yes no -
nv101a yes no no yes -
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
pandora 64.5G 56.9G 7.61G 88% ONLINE -
panroot 40G 26.7G 13.3G 66% ONLINE -
One challenge that came up was the less than stellar performance of ssh over the VirtualBox NAT
interface. So rather than fight this I set up a shared NFS file system in the root
pool to stage the ZFS backup file. This made the process go much faster.
In the host Solaris system
# zfs create -o sharenfs=rw,anon=0 -o mountpoint=/share panroot/share
Prepare the OpenSolaris virtual machine
If you have not already done so,
get a copy
of VirtualBox, install it and set up a virtual machine for OpenSolaris.
Important note: Do not install the VirtualBox guest additions. This will install some SMF services that will fail when booted on bare metal.
Send a ZFS snapshot to the host OS root zpool
Let's take a look around the freshly installed OpenSolaris system to see what we want to send.
Inside the OpenSolaris virtual machine
bash-3.2$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 6.13G 9.50G 46K /rpool
rpool/ROOT 2.56G 9.50G 18K legacy
rpool/ROOT/opensolaris 2.56G 9.50G 2.49G /
rpool/dump 511M 9.50G 511M -
rpool/export 2.57G 9.50G 2.57G /export
rpool/export/home 604K 9.50G 19K /export/home
rpool/export/home/bob 585K 9.50G 585K /export/home/bob
rpool/swap 512M 9.82G 176M -
My host system root zpool (panroot) already has swap and dump, so these won't be needed. And it also has an /export hierarchy for home directories. I will
recreate my OpenSolaris Primary System Administrator user once on bare metal, so it appears the only thing I need to bring over is the root dataset itself.
Inside the OpenSolaris virtual machine
bash-3.2$ pfexec zfs snapshot rpool/ROOT/opensolaris@scooby
bash-3.2$ pfexec zfs send rpool/ROOT/opensolaris@scooby > /net/10.0.2.2/share/scooby.zfs
We are now done with the virtual machine. It can be shut down and the storage reclaimed for other purposes.
Restore the ZFS dataset in the host system root pool
In addition to restoring the OpenSolaris root pool, the canmount property should be set to noauto. I also destroy the NFS shared directory since it will no longer be needed.
# zfs receive panroot/ROOT/scooby < /share/scooby.zfs
# zfs set canmount=noauto panroot/ROOT/scooby
# zfs destroy panroot/shared
Now mount the new OpenSolaris root filesystem and fix up a few configuration files. Specifically
- /etc/zfs/zpool.cache so that all boot environments have the same view of available ZFS pools
- /etc/hostid to keep all of the boot environments using the same hostid. This is extremely important and failure to do this will leave some of your boot environments unbootable - which isn't very useful. /etc/hostid is new to build 100 and later.
Rebuild the OpenSolaris boot archive and we will be done with that filesystem.
# zfs set canmount=noauto panroot/ROOT/scooby
# zfs set mountpoint=/mnt panroot/ROOT/scooby
# zfs mount panroot/ROOT/scooby
# cp /etc/zfs/zpool.cache /mnt/etc/zfs
# cp /etc/hostid /mnt/etc/hostid
# bootadm update-archive -f -R /mnt
Creating boot_archive for /mnt
updating /mnt/platform/i86pc/amd64/boot_archive
updating /mnt/platform/i86pc/boot_archive
# umount /mnt
Make a home directory for your OpenSolaris administrator user (in this example the user is named admin). Also add a GRUB stanza so that OpenSolaris can be booted.
# mkdir -p /export/home/admin
# chown admin:admin /export/home/admin
# cat > /panroot/boot/grub/menu.lst <<DOO
title Scooby
root (hd0,3,a)
bootfs panroot/ROOT/scooby
kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive
DOO
At this point we are done. Reboot the system and you should see a new GRUB stanza
for our new OpenSolaris installation (scooby). Cue large audience applause track.
Live Upgrade and OpenSolaris Boot Environment Administration
On interesting side effect, on the positive side, is the healthy interaction of
Live Upgrade and beadm(1M). For your Solaris and nevada based installations you
can continue to use lucreate(1M), luupgrade(1M), and luactivate(1M). On the
OpenSolaris side you can see all of your Live Upgrade boot environments as well as
your OpenSolaris boot environments. Note that we can create and activate new
boot environments as needed.
When in OpenSolaris
# beadm list
BE Active Mountpoint Space Policy Created
-- ------ ---------- ----- ------ -------
nv101a - - 18.17G static 2008-11-04 00:03
nv95 - - 122.07M static 2008-11-03 12:47
opensolaris - - 2.83G static 2008-11-03 16:23
opensolaris-2008.11-baseline R - 2.49G static 2008-11-04 11:16
s10u6 - - 97.22M static 2008-11-03 12:03
s10x_u6wos_07b - - 205.48M static 2008-11-01 20:51
scooby N / 2.61G static 2008-11-04 10:29
# beadm create doo
# beadm activate doo
# beadm list
BE Active Mountpoint Space Policy Created
-- ------ ---------- ----- ------ -------
doo R - 5.37G static 2008-11-04 16:23
nv101a - - 18.17G static 2008-11-04 00:03
nv95 - - 122.07M static 2008-11-03 12:47
opensolaris - - 25.5K static 2008-11-03 16:23
opensolaris-2008.11-baseline - - 105.0K static 2008-11-04 11:16
s10u6 - - 97.22M static 2008-11-03 12:03
s10x_u6wos_07b - - 205.48M static 2008-11-01 20:51
scooby N / 2.61G static 2008-11-04 10:29
For the first time I have a single Solaris disk environment that can boot Solaris 10, Solaris Express Community Edition (nevada) or OpenSolaris and have access to all of my data. I did have to add a mount for my shared FAT32 file system (I have an iPhone
and several iPods - so Windows do occasionally get opened), but that is about it. Now off to the repository to start playing with
all of the new OpenSolaris goodies like Songbird, Brasero, Bluefish and the Xen bits.
Technocrati Tags:
Sun
Solaris
Virtualization
VirtualBox

Monday February 18, 2008
ZFS and FMA - Two great tastes .....
Our good friend
Isaac Rozenfeld talks about the Multiplicity of
Solaris. When talking about Solaris I will use the phrase "The Vastness of Solaris".
If you have attended a Solaris Boot Camp or Tech Day in the last few years you get
an idea of what we are talking about - when we go on about Solaris hour after hour
after hour.
But the key point in Isaac's multiplicity discussion is how the cornucopia of
Solaris features work together to do some pretty spectacular (and competitively
differentiating) things. In the past we've looked at combinations such as
ZFS and Zones or
Service Management, Role Based Access Control (RBAC) and Least Privilege. Based on
a conversation last week in St. Louis, let's consider how ZFS and Solaris
Fault Management (FMA) play together.
Preparation
Let's begin by creating some fake devices that we can play with. I don't have enough disks
on this particular system, but I'm not going to let that slow me down. If you have sufficient
real hot swappable disks, feel free to use them instead.
# mkfile 1g /dev/disk1
# mkfile 1g /dev/disk2
# mkfile 512m /dev/disk3
# mkfile 512m /dev/disk4
# mkfile 1g /dev/disk5
Now let's create a couple of zpools using the fake devices.
pool1 will be a 1GB
mirrored pool using
disk1 and
disk2.
pool2 will be a 512MB mirrored
pool using
disk3 and
disk4. Device
spare1 will spare both pools in case of a problem -
which we are about to inflict upon the pools.
# zpool create pool1 mirror disk1 disk2 spare spare1
# zpool create pool2 mirror disk3 disk4 spare spare1
# zpool status
pool: pool1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
pool: pool2
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk3 ONLINE 0 0 0
disk4 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
So far so good. If we were to run a scrub on either pool, it will complete immediately.
Remember that unlike hardware RAID disk replacement,
ZFS scrubbing and resilvering only
touches blocks that contain actual data. Since there is no data in these pools (yet),
there is little for the scrubbing process to do.
# zpool scrub pool1
# zpool scrub pool2
# zpool status
pool: pool1
state: ONLINE
scrub: scrub completed with 0 errors on Mon Feb 18 09:24:16 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
pool: pool2
state: ONLINE
scrub: scrub completed with 0 errors on Mon Feb 18 09:24:17 2008
config:
NAME STATE READ WRITE CKSUM
pool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk3 ONLINE 0 0 0
disk4 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
Let's populate both pools with some data. I happen to have a directory of
scenic images that I use as screen backgrounds - that will work nicely.
# cd /export/pub/pix>
# find scenic -print | cpio -pdum /pool1
# find scenic -print | cpio -pdum /pool2
# df -k | grep pool
pool1 1007616 248925 758539 25% /pool1
pool2 483328 248921 234204 52% /pool2
And yes, cp -r would have been just as good.
Problem 1: Simple data corruption
Time to inflict some harm upon the pool. First, some simple corruption.
Writing some zeros over half of the mirror should do quite nicely.
# dd if=/dev/zero of=/dev/dsk/disk1 bs=8192 count=10000 conv=notrunc
10000+0 records in
10000+0 records out
At this point we are unaware that anything has happened to our data. So let's
try accessing some of the data to see if we can observe ZFS self healing in action.
If your system has plenty of memory and is relatively idle, accessing the data may
not be sufficient. If you still end up with no errors after the cpio, try a
zpool scrub - that will catch all errors in the data.
# cd /pool1
# find . -print | cpio -ov > /dev/null
416027 blocks
Let's ask our friend fmstat(1m) if anything is wrong ?
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.1 0 0 0 0 0 0
disk-transport 0 0 0.0 366.5 0 0 0 0 32b 0
eft 0 0 0.0 2.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 1 0 0.0 0.2 0 0 0 0 0 0
io-retire 0 0 0.0 1.1 0 0 0 0 0 0
snmp-trapgen 1 0 0.0 16.0 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 620.3 0 0 0 0 0 0
syslog-msgs 1 0 0.0 9.7 0 0 0 0 0 0
zfs-diagnosis 162 162 0.0 1.5 0 0 1 0 168b 140b
zfs-retire 1 1 0.0 112.3 0 0 0 0 0 0
As the guys in the Guinness commercial say, "Brilliant!" The important thing to note
here is that the zfs-diagnosis engine has run several times indicating that there is
a problem somewhere in one of my pools. I'm also running this on Nevada so the
zfs-retire engine has also run, kicking in a hot spare due to excessive errors.
So which pool is having the problems ? We continue our FMA investigation
to find out.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH Major
Fault class : fault.fs.zfs.vdev.checksum
Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
# zpool status -x
pool: pool1
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress, 44.83% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
mirror DEGRADED 0 0 0
spare DEGRADED 0 0 0
disk1 DEGRADED 0 0 162 too many errors
spare1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 INUSE currently in use
errors: No known data errors
This tells us all that we need to know. The device
disk1 was found to have
quite a few checksum errors - so many in fact that it was replaced automatically
by a hot spare. The spare was
resilvering
and a full complement of data replicas would be available soon. The entire process was
automatic and completely observable.
Since we inflicted harm upon the (fake) disk device ourself, we know that it is in fact quite
healthy. So we can restore our pool to its original configuration rather simply - by detaching
the spare and clearing the error. We should also clear the FMA counters and repair the
ZFS vdev so that we can tell if anything else is misbehaving in either this or another pool.
# zpool detach pool1 spare1
# zpool clear pool
# zpool status pool1
pool: pool1
state: ONLINE
scrub: resilver completed with 0 errors on Mon Feb 18 10:25:26 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
# fmadm reset zfs-diagnosis
# fmadm reset zfs-retire
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 223.5 0 0 0 0 32b 0
eft 1 0 0.0 4.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 4 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 1.1 0 0 0 0 0 0
snmp-trapgen 4 0 0.0 8.8 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 372.7 0 0 0 0 0 0
syslog-msgs 4 0 0.0 5.4 0 0 0 0 0 0
zfs-diagnosis 0 0 0.0 1.4 0 0 0 0 0 0
zfs-retire 0 0 0.0 0.0 0 0 0 0 0 0
# fmdump -v -u d82d1716-c920-6243-e899-b7ddd386902e
TIME UUID SUNW-MSG-ID
Feb 18 09:51:49.3025 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
100% fault.fs.zfs.vdev.checksum
Problem in:
Affects: zfs://pool=pool1/vdev=449a3328bc444732
FRU: -
Location: -
# fmadm repair zfs://pool=pool1/vdev=449a3328bc444732
fmadm: recorded repair to zfs://pool=pool1/vdev=449a3328bc444732
# fmadm faulty
Problem 2: Device failure
Time to do a little more harm. In this case I will simulate the failure of
a device by removing the fake device. Again we will access the pool and then
consult fmstat to see what is happening (are you noticing a pattern here????).
# rm -f /dev/dsk/disk2
# cd /pool1
# find . -print | cpio -oc > /dev/null
416027 blocks
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 214.2 0 0 0 0 32b 0
eft 1 0 0.0 4.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 4 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 1.1 0 0 0 0 0 0
snmp-trapgen 4 0 0.0 8.8 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 372.7 0 0 0 0 0 0
syslog-msgs 4 0 0.0 5.4 0 0 0 0 0 0
zfs-diagnosis 0 0 0.0 1.4 0 0 0 0 0 0
zfs-retire 0 0 0.0 0.0 0 0 0 0 0 0
Rats, the find ran totally out of cache from the last example. As before, should
this happen,proceed directly to zpool scrub.
# zpool scrub pool1
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 190.5 0 0 0 0 32b 0
eft 1 0 0.0 4.1 0 0 0 0 1.4M 0
fmd-self-diagnosis 5 0 0.0 0.5 0 0 0 0 0 0
io-retire 1 0 0.0 1.0 0 0 0 0 0 0
snmp-trapgen 6 0 0.0 7.4 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 329.0 0 0 0 0 0 0
syslog-msgs 6 0 0.0 4.6 0 0 0 0 0 0
zfs-diagnosis 16 1 0.0 70.3 0 0 1 1 168b 140b
zfs-retire 1 0 0.0 509.8 0 0 0 0 0 0
Again, hot sparing has kicked in automatically. The evidence of this is the
zfs-retire engine running.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 18 11:07:29 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3 Major
Feb 18 11:16:43 06bfe323-2570-46e8-f1a2-e00d8970ed0d
Fault class : fault.fs.zfs.device
Description : A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for
more information.
Response : No automated response will occur.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
# zpool status -x
pool: pool1
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
scrub: resilver in progress, 4.94% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
mirror DEGRADED 0 0 0
disk1 ONLINE 0 0 0
spare DEGRADED 0 0 0
disk2 UNAVAIL 0 0 0 cannot open
spare1 ONLINE 0 0 0
spares
spare1 INUSE currently in use
errors: No known data errors
As before, this tells us all that we need to know. A device (disk2) has failed and
is no longer in operation. Sufficient spares existed and one was automatically
attached to the damaged pool. Resilvering completed successfully and the data is
once again fully mirrored.
But here's the magic. Let's repair the device - again simulated with our fake
device.
# mkfile 1g /dev/dsk/disk2
# zpool repair pool1 disk2
# zpool status pool1
pool: pool1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 4.86% done, 0h1m to go
config:
NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
mirror DEGRADED 0 0 0
disk1 ONLINE 0 0 0
spare DEGRADED 0 0 0
replacing DEGRADED 0 0 0
disk2/old UNAVAIL 0 0 0 cannot open
disk2 ONLINE 0 0 0
spare1 ONLINE 0 0 0
spares
spare1 INUSE currently in use
errors: No known data errors
Get a cup of coffee while the resilvering process runs.
# zpool status
pool: pool1
state: ONLINE
scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
# fmadm faulty
Notice the nice integration with FMA. Not only was the new device resilvered, but
the hot spare was detached and the FMA fault was cleared. The fmstat counters still
show that there was a problem and the fault report still existes in the fault log for later
interrogation.
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 171.5 0 0 0 0 32b 0
eft 1 0 0.0 3.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 6 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 0.9 0 0 0 0 0 0
snmp-trapgen 6 0 0.0 6.8 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 294.3 0 0 0 0 0 0
syslog-msgs 6 0 0.0 4.2 0 0 0 0 0 0
zfs-diagnosis 36 1 0.0 51.6 0 0 0 1 0 0
zfs-retire 1 0 0.0 170.0 0 0 0 0 0 0
# fmdump
TIME UUID SUNW-MSG-ID
Feb 16 11:38:16.0976 48935791-ff83-e622-fbe1-d54c20385afc ZFS-8000-GH
Feb 16 11:38:30.8519 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233 ZFS-8000-GH
Feb 18 09:51:49.3025 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713 ZFS-8000-GH
Feb 18 09:56:24.8029 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
Feb 18 10:23:07.2228 7c04a6f7-d22a-e467-c44d-80810f27b711 ZFS-8000-GH
Feb 18 10:25:14.6429 faca0639-b82b-c8e8-c8d4-fc085bc03caa ZFS-8000-GH
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3
Feb 18 11:16:44.2497 06bfe323-2570-46e8-f1a2-e00d8970ed0d ZFS-8000-D3
# fmdump -V -u 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
TIME UUID SUNW-MSG-ID
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3
TIME CLASS ENA
Feb 18 11:07:27.8476 ereport.fs.zfs.vdev.open_failed 0xb22406c635500401
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
code = ZFS-8000-D3
diag-time = 1203354449 236999
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Dimension XPS
chassis-id = 7XQPV21
server-id = arrakis
(end authority)
mod-name = zfs-diagnosis
mod-version = 1.0
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.fs.zfs.device
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x3a2ca6bebd96cfe3
vdev = 0xedef914b5d9eae8d
(end asru)
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x3a2ca6bebd96cfe3
vdev = 0xedef914b5d9eae8d
(end resource)
(end fault-list[0])
fault-status = 0x3
__ttl = 0x1
__tod = 0x47b9bb51 0x1ef7b430
# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset
# fmadm reset zfs-retire
fmadm: zfs-retire module has been reset
Problem 3: Unrecoverable corruption
For those of you that have attended one of my Boot Camps or Solaris Best Practices training classes know,
House is one of my favorite TV shows - the only one that I watch regularly. And this next example would make a perfect episode. Is it likely to happen ? No, but it is so cool when it does :-)
Remember our second pool,
pool2. It has the same contents as
pool1. Now, let's do the unthinkable - let's corrupt both halves of the mirror. Surely data loss will follow, but the fact that Solaris stays up and running and can report what happened is pretty spectacular. But it gets so much better than that.
# dd if=/dev/zero of=/dev/dsk/disk3 bs=8192 count=10000 conv=notrunc
# dd if=/dev/zero of=/dev/dsk/disk4 bs=8192 count=10000 conv=notrunc
# zpool scrub pool2
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 166.0 0 0 0 0 32b 0
eft 1 0 0.0 3.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 6 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 0.9 0 0 0 0 0 0
snmp-trapgen 8 0 0.0 6.3 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 294.3 0 0 0 0 0 0
syslog-msgs 8 0 0.0 3.9 0 0 0 0 0 0
zfs-diagnosis 1032 1028 0.6 39.7 0 0 93 2 15K 13K
zfs-retire 2 0 0.0 158.5 0 0 0 0 0 0
As before, lots of zfs-diagnosis activity. And two hits to zfs-retire. But we
only have one spare - this should be interesting. Let's see what is happenening.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH Major
Feb 18 13:18:42 c3889bf1-8551-6956-acd4-914474093cd7
Fault class : fault.fs.zfs.vdev.checksum
Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 16 11:38:30 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233 ZFS-8000-GH Major
Feb 18 09:51:49 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713
Feb 18 10:23:07 7c04a6f7-d22a-e467-c44d-80810f27b711
Feb 18 13:18:42 0a1bf156-6968-4956-d015-cc121a866790
Fault class : fault.fs.zfs.vdev.checksum
Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
# zpool status -x
pool: pool2
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:
NAME STATE READ WRITE CKSUM
pool2 DEGRADED 0 0 2.60K
mirror DEGRADED 0 0 2.60K
spare DEGRADED 0 0 2.43K
disk3 DEGRADED 0 0 5.19K too many errors
spare1 DEGRADED 0 0 2.43K too many errors
disk4 DEGRADED 0 0 5.19K too many errors
spares
spare1 INUSE currently in use
errors: 247 data errors, use '-v' for a list
So ZFS tried to bring in a hot spare, but there were insufficient replicas to
be able to reconstruct all of the data. But here is where is gets interesting.
Let's see what zpool status -v says about things.
zpool status -v
pool: pool1
state: ONLINE
scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 INUSE in use by pool 'pool2'
errors: No known data errors
pool: pool2
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:
NAME STATE READ WRITE CKSUM
pool2 DEGRADED 0 0 2.60K
mirror DEGRADED 0 0 2.60K
spare DEGRADED 0 0 2.43K
disk3 DEGRADED 0 0 5.19K too many errors
spare1 DEGRADED 0 0 2.43K too many errors
disk4 DEGRADED 0 0 5.19K too many errors
spares
spare1 INUSE currently in use
errors: Permanent errors have been detected in the following files:
/pool2/scenic/cider mill crowds.jpg
/pool2/scenic/Cleywindmill.jpg
/pool2/scenic/csg_Landscapes001_GrandTetonNationalPark,Wyoming.jpg
/pool2/scenic/csg_Landscapes002_ElowahFalls,Oregon.jpg
/pool2/scenic/csg_Landscapes003_MonoLake,California.jpg
/pool2/scenic/csg_Landscapes005_TurretArch,Utah.jpg
/pool2/scenic/csg_Landscapes004_Wildflowers_MountRainer,Washington.jpg
/pool2/scenic/csg_Landscapes!idx011.jpg
/pool2/scenic/csg_Landscapes127_GreatSmokeyMountains-NorthCarolina.jpg
/pool2/scenic/csg_Landscapes129_AcadiaNationalPark-Maine.jpg
/pool2/scenic/csg_Landscapes130_GettysburgNationalPark-Pennsylvania.jpg
/pool2/scenic/csg_Landscapes131_DeadHorseMill,CrystalRiver-Colorado.jpg
/pool2/scenic/csg_Landscapes132_GladeCreekGristmill,BabcockStatePark-WestVirginia.jpg
/pool2/scenic/csg_Landscapes133_BlackwaterFallsStatePark-WestVirginia.jpg
/pool2/scenic/csg_Landscapes134_GrandCanyonNationalPark-Arizona.jpg
/pool2/scenic/decisions decisions.jpg
/pool2/scenic/csg_Landscapes135_BigSur-California.jpg
/pool2/scenic/csg_Landscapes151_WataugaCounty-NorthCarolina.jpg
/pool2/scenic/csg_Landscapes150_LakeInTheMedicineBowMountains-Wyoming.jpg
/pool2/scenic/csg_Landscapes152_WinterPassage,PondMountain-Tennessee.jpg
/pool2/scenic/csg_Landscapes154_StormAftermath,OconeeCounty-Georgia.jpg
/pool2/scenic/Brig_Of_Dee.gif
/pool2/scenic/pvnature14.gif
/pool2/scenic/pvnature22.gif
/pool2/scenic/pvnature7.gif
/pool2/scenic/guadalupe.jpg
/pool2/scenic/ernst-tinaja.jpg
/pool2/scenic/pipes.gif
/pool2/scenic/boat.jpg
/pool2/scenic/pvhawaii.gif
/pool2/scenic/cribgoch.jpg
/pool2/scenic/sun1.gif
/pool2/scenic/sun1.jpg
/pool2/scenic/sun2.jpg
/pool2/scenic/andes.jpg
/pool2/scenic/treesky.gif
/pool2/scenic/sailboatm.gif
/pool2/scenic/Arizona1.jpg
/pool2/scenic/Arizona2.jpg
/pool2/scenic/Fence.jpg
/pool2/scenic/Rockwood.jpg
/pool2/scenic/sawtooth.jpg
/pool2/scenic/pvaptr04.gif
/pool2/scenic/pvaptr07.gif
/pool2/scenic/pvaptr11.gif
/pool2/scenic/pvntrr01.jpg
/pool2/scenic/Millport.jpg
/pool2/scenic/bryce2.jpg
/pool2/scenic/bryce3.jpg
/pool2/scenic/monument.jpg
/pool2/scenic/rainier1.gif
/pool2/scenic/arch.gif
/pool2/scenic/pv-anzab.gif
/pool2/scenic/pvnatr15.gif
/pool2/scenic/pvocean3.gif
/pool2/scenic/pvorngwv.gif
/pool2/scenic/pvrmp001.gif
/pool2/scenic/pvscen07.gif
/pool2/scenic/pvsltd04.gif
/pool2/scenic/banhall28600-04.JPG
/pool2/scenic/pvwlnd01.gif
/pool2/scenic/pvnature08.gif
/pool2/scenic/pvnature13.gif
/pool2/scenic/nokomis.jpg
/pool2/scenic/lighthouse1.gif
/pool2/scenic/lush.gif
/pool2/scenic/oldmill.gif
/pool2/scenic/gc1.jpg
/pool2/scenic/gc2.jpg
/pool2/scenic/canoe.gif
/pool2/scenic/Donaldson-River.jpg
/pool2/scenic/beach.gif
/pool2/scenic/janloop.jpg
/pool2/scenic/grobacro.jpg
/pool2/scenic/fnlgld.jpg
/pool2/scenic/bells.gif
/pool2/scenic/Eilean_Donan.gif
/pool2/scenic/Kilchurn_Castle.gif
/pool2/scenic/Plockton.gif
/pool2/scenic/Tantallon_Castle.gif
/pool2/scenic/SouthStockholm.jpg
/pool2/scenic/BlackRock_Cottage.jpg
/pool2/scenic/seward.jpg
/pool2/scenic/canadian_rockies_csg110_EmeraldBay.jpg
/pool2/scenic/canadian_rockies_csg111_RedRockCanyon.jpg
/pool2/scenic/canadian_rockies_csg112_WatertonNationalPark.jpg
/pool2/scenic/canadian_rockies_csg113_WatertonLakes.jpg
/pool2/scenic/canadian_rockies_csg114_PrinceOfWalesHotel.jpg
/pool2/scenic/canadian_rockies_csg116_CameronLake.jpg
/pool2/scenic/Castilla_Spain.jpg
/pool2/scenic/Central-Park-Walk.jpg
/pool2/scenic/CHANNEL.JPG
In my best Hugh Laurie voice trying to sound very Northeastern American, that is so cool! But we're not even
done yet. Let's take this list of files and restore them - in this case, from pool1. Operationally this
would be from a back up tape or nearline backup cache, but for our purposes, the contents in pool1 will
do nicely.
First, let's clear the zpool error counters and return the spare disk. We want to make sure
that our restore works as desired. Oh, and clear the FMA stats while we're at it.
# zpool clear
# zpool detach pool2 spare1
# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset
# fmadm reset zfs-retire
fmadm: zfs-retire module has been reset
Now individually restore the files that have errors in them and check again. You can even export and reimport
the pool and you will find a very nice, happy, and thoroughly error free ZFS pool. Some rather unpleasant gnashing of
zpool status -v output with awk has been omitted for sanity sake.
# zpool scrub pool2
# zpool status pool2
pool: pool2
state: ONLINE
scrub: scrub completed with 0 errors on Mon Feb 18 14:04:56 2008
config:
NAME STATE READ WRITE CKSUM
pool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk3 ONLINE 0 0 0
disk4 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
# zpool export pool2
# zpool import pool2
# dircmp -s /pool1 /pool2
Conclusions and Review
So what have we learned ? ZFS and FMA are two great tastes that taste great together. No, that's chocolate
and peanut butter, but you get this idea. One more great example of Isaac's Multiplicity of Solaris.
That, and I have finally found a good lab exercise for the FMA training materials. Ever since Christine Tran put
the FMA workshop together, we have been looking for some good FMA lab exercises. The materials reference a synthetic
fault generator that is not available in public (for obvious reasons). I haven't explored the FMA test harness
enough to know if there is anything in there that would make a good lab. But this exercise that we have just
explored seems to tie a number of key pieces together.
And of course, one more reason why Roxy says,
"You should run Solaris."
Technocrati Tags:
Sun
Solaris
ZFS
FMA
Netherton

Wednesday October 03, 2007
LIve Upgrade from Solaris 10 11/06 to 8/07 without nonglobal zones
Live Upgrade is one of the most useful Solaris features, yet in my travels around the US I still don't see it used as much as I would like. I can think of several reasons for this - not all of them totally valid
- I tried it once a long time ago and a patch or package that wasn't LU aware messed up my current boot environment. Not valid for Solaris components although we do see the occasional partner product with this problem. The last one I saw was the NVidia driver,
and the good folks from NVidia fixed it very quickly once reported.
- The documentation can be a bit intimidating. Valid with a capital V. But Live Upgrade is an amazingly flexible feature, so at some point you do have to describe these capabilities. As a guide through this documentation, several folks have blogged managable howto guides. You can find mine back in March 2007, although I've recently updated it. And there are other good blogs with plentry of examples. There is a very good Blueprint on Live Upgrade.
- It doesn't work with the Veritas Volume Manager.
- I didn't know about Live Upgrade. Well, you do now. But I have noticed that a lot of the Solaris conversation is focused on new features, like ZFS, Zones, SMF, DTrace and some of the older features like Flash archives and Live Upgrade don't receive the attention they deserve. The simple fact is that Live Upgrade takes all of the pain out of the patching process, at least once you know what to patch.
And I'm sure there are other reasons, but these are the ones I hear most often.
Let's turn our attention to the topic at hand, upgrading a Solaris 10 11/06 system to 8/07, without zones. This example will be on an x64
system, but the SPARC approach is simular.
If you have read my earlier
blog on Live Upgrade, you will recall the process is
- Read Infodoc Infodoc 72099 and install any required patches
- Install the LU packages SUNWluu SUNWlur and SUNWlucfg (if present) from the installation media
- lurename(1m) if you want to change the name of your new boot environment
- lumake(1m) or ludelete(1m) + lucreate(1m) to repopulate the target boot environment with the proper software and configuration files
- luupgrade(1m) to upgrade the target boot environment
- luactivate(1m) to activate the new boot environment
- init 0 to perform the file synchronization and conversions, create the new boot archive and update your GRUB menu
So I fire up my web browser and run over to
SunSolve to pick up
Infodoc 72099 and see a rather large set of patches. And there are two lists, one for systems with non-global
zones and one without. Since we're looking at a system without non-global zones we will start with the shorter of the two lists (the next
article will cover systems with nonglobal zones).
Apparently we need patches
Solaris 10 x86 118816-03 or higher nawk patch
Solaris 10 x86 120901-03 or higher libzonecfg patch
Solaris 10 x86 121334-04 or higher SUNWzoneu required patch
Solaris 10 x86 119255-42 or higher patchadd/patchrm patches
Solaris 10 x86 119318-01 or higher SVr4 Packaging Commands (usr) Patch
Solaris 10 x86 117435-02 or higher biosdev patch for GRUB Boot
Reboot after installation
Solaris 10 x86 120236-01 or higher SUNWluzone required patches
Solaris 10 x86 121429-08 or higher SUNWluzone required patches
Solaris 10 x86 121003-03 or higher pax patch
Solaris 10 x86 123122-02 or higher prodreg patch
Solaris 10 x86 121005-03 sh patch
Solaris 10 x86 119043-10 /usr/sbin/svccfg patch
Solaris 10 x86 121902-02 i.manifest r.manifest class action script patch
Solaris 10 x86 120901-03 libzonecfg patch
Solaris 10 x86 120069-03 telnet security patch
Solaris 10 x86 120070-02 cpio patch
Solaris 10 x86 123333-01 tftp patch
Hmmm, seems like a lot of patches and a required reboot! So I fire up our new friend updatemanager to patch my system.
I see that there is a new updatemanager patch available (121119-13), so I installed that one all by itself
and restarted updatemanager.
I soon realize that my choice of patching tools is making this a bit challenging. Users of patch tools such
as
Patch Check Advanced(PCA) may have an easier time, but I
was determined to do this with updatemanager, with occasional help from the patch READMEs in SunSolve.
The list of patches required for this upgrade applies to any release of Solaris 10. A fresh install of a Solaris 10 11/06 system only needed the following four patches - which is a lot better than I first thought.
119255-42
121429-08
126539-01 as it replaces the required 121902-02
125419-01 as it replaces the required 120069-03
The difficulty with updatemanager was with the set of obsoleted patches. Something like the required 121902-02 that was
obsoleted by 126539-01 which was installed took a bit of manual trolling through patch READMEs. So I'll save you the
research - it came down to only the four above patches.
One important note: the required reboot after patch 117435-02 wasn't needed after all - so I'll try to save
all of you Solaris 10 11/06 users one reboot. While I have your attention, it is a good idea, if not a best practice, to
install patch and packaging patches separately.
Feeling a lot better about this process, I proceed and install the four required patches using updatemanager in two steps
(119255-42 and then the other three patches) and all succeeded, as expected. All that was left to do was finish the
standard procedure
# mount -o ro -F hsfs `lofiadm -a /export/iso/s10u4/solarisdvd.iso` /mnt
# pkgadd -d /mnt/Solaris_10/Product SUNWlur SUNWluu SUNWlucfg
# lurename -e nv71 -n s10u4
# lumake -n s10u4
# luupgrade -u -s /mnt -n s10u4
# luactivate s10u4
# init 0
And all went as expected. Next time I will tackle the longer list of patches and examine the same upgrade path, but with nonglobal zones.
Technocrati Tags:
Sun
Solaris
Upgrade
Laptop
Live Upgrade

Monday March 26, 2007
Securing MySQL using SMF - the Ultimate Manifest
The best way to learn the Solaris Service Management Facility (SMF) is to migrate a legacy
service. The version of MySQL that comes with Solaris is an ideal application. It is
relatively simple, has few dependencies, and can be done in just a few quick edits of
an existing manifest (utmp would be a good starting template). We cover the basic process
in the
SMF Deep Dive and various people have contributed manifests to
OpenSolaris and
Blastwave. While these are good illustrations of how easy the process is, few show what SMF can really do. The motivation for this how-to came from a
recent Solaris Bootcamp attendee who asked "what was wrong with the RC scripts the way they were ?".
Without skipping a beat.....
- Easy support of multiple service instances
- Deterministic location of service log files
- Timeouts on the start and stop methods to prevent system boots from hanging indefinetely.
- Quickly observable service state
- Flexible service dependencies
- Automatic restarting of the service upon failure
Upon closer inspection, recognizing when the service terminated and
restarting it automatically isn't that special for mysql. The mysqld_safe daemon actually performs that step, restarting
the database server if it fails. Yes, this is unique to mysql and may not exist for other services. Certianly, if
the mysqld_safe parent actually fails, SMF does provide an additional capability by automatically restarting it. But we need more.
Most of the service migration demonstrations are single instance with no downstream application dependencies - so we still need more.
The mysql service start script runs through a set of configuration files, setting variables and starting a
detached daemon, so it's highly unlikely that it will ever get stuck. Sure, it can get hacked and have bad things happen to it, but as delivered it is relatively safe. So we still need more.
The answer to the question lies in security.
SMF provides a rich set of security features that demonstrate the power of Solaris Role Based Access
Control (RBAC) and least privilege. Contrary to what you might think, these features are quite
easy to use - once you learn a few simple concepts. This is how we will answer the question
"what was wrong with the RC scripts the way they were?".
Authorizations
One of the most useful applications of RBAC is to create adminstration and operations roles.
While the details of these roles will vary from customer to customer, the common theme is
that operator roles should be able to start and stop a service in a safe manner and an
administrative role should be able to modify service properties (of which some of those may
be the ability to start or stop the service).
Historically this has been accomplished by third party security software inserting itself
all over the kernel (sometimes in a manner that makes upgrades or maintenance difficult) or custom
scripts that make use of setuid(2). Solaris 10 can perform many of these functions with just a
few entries to some configuration files, and SMF makes this process extremely easy.
You can get lots of valuable information on Solaris Security features (roles, profiles, auths,
privileges) at the
OpenSolaris Security
Community. As you navigate the wealth of white papers, ARC cases, and how-to examples, think
of Solaris authorizations as the magic that makes this possible (or more precisely simple).
In a sentence, auths are labels that a privileged application uses to restrict access to it's features.
In our case the privileged applications are
svcadm(1M) and
svccfg(1M). If you read the
smf_security(5) man page (which is excellent reading) you will see that SMF provides several authorizations.
- solaris.smf.manage - ability to start and stop any SMF managed service (good - but not what we are looking for)
- action_authorization (in the general property group) - allows a non-root user to run the methods (start, stop, and refresh)
- value_authorization (any property group) - change properties in the property group (such as general/enabled)
- modify_authorization (any property group) - add or delete properties in the property group
Now this is getting interesting. So it appears that we can use either the action or modify authorization
for the operator role. So which one do we use ?
The action_authorization would only allow running the method but not modifying any of the properties. The
implication is that you can do
# svcadm enable -t mysql
but not
# svcadm enable mysql
The difference between the two commands is that enable without -t will try to set the property general/enabled to true in additional to running the start method. This would require the value_authorization. But value_authorization will allow you to change (almost) any property in the
property group (in this case the general property group), so let's see what else value_authorization will
let you do.
# svcprop -p general ssh
general/enabled boolean true
general/action_authorization astring solaris.smf.manage.ssh
general/entity_stability astring Unstable
general/single_instance boolean true
Hmmm, the only properties that might be abused would be the authorizations, but those require additional authorizations (solaris.smf.modify) to change. So it would seem that value_authorization would be safe for an operator role - unorthodox perhaps, but safe.
modify_authorization would allow the creation of other service properties, and if limited to the general
property group might be confusing, but relatively harmless - unless of course we add a new general property later.
For this reason, modify_authorization would not be a good canidate for an operator role.
So which authorization to use ? Use action_authorization if you want a user (or role) to be able to start and
stop the service, but not make the change permanent. This is the most common case. Use value_authorization in the general property group if you want that user or role to be able to permanently turn a service on or off - this
is generally an adminstrative role.
Let's put this all together.
Start with your existing SMF manifest for MySQL. If you don't have one, you can use mine at
http://blogs.sun.com/resources/bobn/mysql.xml or Keith Lawson's
contributed MySQL manifest at the
OpenSolaris SMF Contributed Manifests and Methods page.
Add the following section
<property_group name='general' type='framework'>
<propval name='action_authorization' type='astring' value='mysql.operator' />
<propval name='value_authorization' type='astring' value='mysql.administrator' />
</property_group>
Import the new manifest by the method of your choice (svccfg import, /lib/svc/method/manifest-import, or reboot)
and your new MySQL can be managed by auths. So how to we get those auths assigned to users (or roles ?).
Authorizations are granted to users and roles by the configuration file /etc/user_attr. You can read the
user_attr(4) man page for all of the details, but the process is to add auths=mysql.operator to the user
or role entry. For example
# grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator
It is possible that a user or role may not be present in /etc/user_attr. In that case just add
a line like the one above and assign the appropriate auth.
Let's see all of this in action.
% auths
mysql.operator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable
% svcadm enable -t mysql
% svcs mysql
STATE STIME FMRI
online 15:51:02 svc:/application/mysql:default
So far so good.
% svcadm enable mysql
svcadm: svc:/application/mysql:default: Permission denied.
Why did this fail ?
% svcprop -p general mysql
general/enabled boolean true
general/action_authorization astring mysql.operator
general/entity_stability astring Unstable
general/single_instance boolean true
general/value_authorization astring mysql.administrator
Because enable also tries to set the general/enabled property - and that requires value or modify authorization. Change my user definition in /etc/user_attr
% grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator,mysql.administrator
% auths
mysql.operator,mysql.administrator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable
% svcadm enable mysql
% svcs mysql
STATE STIME FMRI
online 16:10:37 svc:/application/mysql:default
This is all very cool - but we can still do more.
Removing Root from the Equation
For both simplicity and compatibility with other operating systems, the MySQL service is started by
a script that is run as root. This script is generally linked into /etc/rc3.d, but since we have converted
it to an SMF service we have many more options. We have already looked at delegated administration using
auths, time to turn our attention to privileges.
# /etc/sfw/mysql/mysql.server start
# ps -ef | grep mysqld | grep -v grep
mysql 1975 1955 0 21:43:17 pts/8 0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --user=mysql --pid
root 1955 1 0 21:43:17 pts/8 0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa
# /etc/sfw/mysql/mysql.server stop
This suggests two immediate questions. Does the parent mysqld_safe really have to run as root, or can it be
started as a lesser privileged user ? If it can run as a non-root user, exactly what privileges are required to run mysql ?
The answer to the first question is simple: it can be run as a regular user. It only runs as root
out of convenience to operating systems that don't have as sophisticated a security framework as Solaris.
# su - mysql
Sun Microsystems Inc. SunOS 5.11 snv_57 October 2007
$ sh /etc/sfw/mysql/mysql.server start
$ /usr/sfw/bin/mysqladmin status
Uptime: 1174 Threads: 1 Questions: 1 Slow queries: 0 Opens: 6 Flush tables: 1 Open tables: 0 Queries per second avg: 0.001
$ sh /etc/sfw/mysql/mysql.server stop
Killing mysqld with pid 1975
Wait for mysqld to exit done
$ exit
#
Now that we have established the fact that a fully privileged user isn't required to run MySQL, what privileges are
are really required ? How far can we restrict the mysql user ? Glenn Brunette's privilege debugger privdebug.pl is the perfect tool to help us answer this question.
# privdebug.pl -f -v -e "su - mysql /usr/sfw/sbin/mysqld_safe --user=mysql"
STAT TIMESTAMP PPID PID PRIV CMD
USED 2005619300419 2211 2212 proc_taskid su
USED 2005620883559 2211 2212 proc_setid su
USED 2005621147993 2211 2212 proc_setid su
USED 2005621161490 2211 2212 proc_setid su
USED 2005621165094 2211 2212 proc_setid su
USED 2005630560973 2211 2212 proc_exec su
Starting mysqld daemon with databases from /var/mysql contract_event
USED 2005679230394 2211 2212 proc_fork sh
USED 2005750348321 2211 2212 proc_fork sh
USED 2005751386190 2212 2214 proc_exec sh
USED 2005756249415 2211 2212 proc_fork sh
USED 2005757238096 2212 2215 proc_fork sh
USED 2005758495289 2212 2215 proc_exec sh
USED 2005761778059 2211 2212 proc_fork sh
USED 2005762623018 2212 2217 proc_fork sh
USED 2005763874569 2212 2217 proc_exec sh
USED 2005767441408 2211 2212 proc_fork sh
USED 2005768337263 2212 2219 proc_exec sh
USED 2005772916576 2211 2212 proc_fork sh
USED 2005773996432 2212 2220 proc_fork sh
USED 2005775465400 2212 2220 proc_exec sh
USED 2005778750305 2211 2212 proc_fork sh
USED 2005779846375 2212 2222 proc_exec sh
USED 2005782042348 2211 2212 proc_fork sh
USED 2005783110622 2212 2223 proc_exec sh
USED 2005785636236 2211 2212 proc_fork sh
USED 2005786824801 2212 2224 proc_exec sh
USED 2005788593079 2212 2224 proc_exec nohup
USED 2005790693138 2212 2224 proc_exec nohup
USED 2005792812264 2211 2212 proc_fork sh
USED 2005794010658 2212 2225 proc_exec sh
USED 2005795756145 2212 2225 proc_exec nohup
USED 2005797704273 2212 2225 proc_exec nohup
NEED 2005799674735 2211 2212 file_dac_write sh
USED 2005800708905 2211 2212 proc_fork sh
USED 2005801869396 2212 2226 proc_exec sh
USED 2005804780370 2211 2212 proc_fork sh
USED 2005805854317 2212 2227 proc_exec sh
USED 2005807860051 2211 2212 proc_fork sh
USED 2005808907677 2212 2228 proc_exec sh
USED 2005811293197 2211 2212 proc_fork sh
USED 2005812393916 2212 2229 proc_exec sh
USED 2005814589669 2212 2229 proc_exec nohup
USED 2005816674186 2212 2229 proc_exec nohup
STOPPING server from pid file /var/mysql/pandora.pid contract_event
070325 22 11 mysqld ended 18 contract_event
Ignore the proc_taskid and proc_setid, they are artifacts of using su(1M) to run the database server as user
mysql. We see that mysqld only needs proc_fork and proc_exec. The file_dac_write failure comes from a call to
access(2) and is not needed for proper operation.
What do we do with what we have just learned ?
Referring to the smf_method(5) man page (another excellent read), it seems that all we need to do is add a
method_credential option to the various methods (start, stop, and refresh). The appropriate section of my new and improved MySQL manifest now looks like
<exec_method type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='60'>
<method_context>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='120'>
<method_context>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart' timeout_seconds='120'>
<method_context>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
So we quickly modify our manifest and import it using one of the standard methods (svccfg import, /lib/svc/method/manifest-import, or a reboot) and we should be done, right ? Well...... not exactly - but we're close.
% svccfg enable mysql
% svcs mysql
STATE STIME FMRI
maintenance 21:53:37 svc:/application/mysql:default
$ tail -5 `svcprop -p restarter/logfile mysql`
[ Mar 26 21:51:12 Method "stop" exited with status 0 ]
[ Mar 26 21:53:36 Enabled. ]
[ Mar 26 21:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
svc.startd could not set context for method: chdir: No such file or directory
[ Mar 26 21:53:37 Method "start" exited with status 96 ]
Doh! When we followed the MySQL installation instructions at /etc/sfw/mysql/README.solaris.mysql
we created a user account called mysql. But we didn't specify a home directory, did we ? No - so
the default template value of /home/mysql was used. But there is no /home/mysql, is there ? Well, no.
How do we fix this ?
Set a reasonable home directory for the mysql user. How about /var/mysql ?
Elsewhere in the installation instructions we did set ownership and proper permissions to this
directory - so that would seem like a reasonable home directory.
As root
# usermod -d /var/mysql mysql
That is one solution, but it may not be practical for all cases. Perhaps a better idea would be
to provide a working directory for each of the methods. The benefit is that I could set it differently for each
service instance. This would be done in the method_context tag for the method. So I modify my service
manifest to look like
<exec_method type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='60'>
<method_context working_directory='/var/mysql'>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='120'>
<method_context working_directory='/var/mysql'>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart' timeout_seconds='120'>
<method_context working_directory='/var/mysql'>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
Reimport the manifest and let's see how things go.
# svccfg import /var/svc/manifest/application/mysql.xml
# svcadm clear mysql
# svcs mysql
STATE STIME FMRI
maintenance 22:17:49 svc:/application/mysql:default
Argh - now what ?
# tail -5 `svcprop -p restarter/logfile mysql`
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]
[ Mar 26 22:17:49 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]
Doh! Since Solaris delivers MySQL as a legacy service the start script doesn't have execute
permissions for the mysql user. That's easy to fix.
# ls -l /etc/sfw/mysql/mysql.server
-rwxr--r-- 1 root sys 5655 Mar 22 17:05 /etc/sfw/mysql/mysql.server
# chown mysql /etc/sfw/mysql/mysql.server
# svcadm clear mysql
# svcs mysql
STATE STIME FMRI
online 22:23:08 svc:/application/mysql:default
bash-3.00$
Now that's more like it. One last item to check.
# ps -ef | grep mysqld | grep -v grep
mysql 12656 12634 0 22:23:11 ? 0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --pid-file=/var/my
mysql 12634 1 0 22:23:09 ? 0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa
# ppriv 12634
12634: /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var
flags =
E: basic,!file_link_any,!proc_info,!proc_session
I: basic,!file_link_any,!proc_info,!proc_session
P: basic,!file_link_any,!proc_info,!proc_session
L: all
Now that's what I wanted to see. The parent mysqld_safe is now running as user mysql and with
exactly the right privileges. This is very cool indeed. Armed with this information we could
also create a zone and use the limitpriv attribute to restrict the zone privilege - but we'll leave that
for another day.
Conclusion
It is quite easy to leverage not only Solaris authorizations but to run services with restricted
privileges. We have presented a few templates and a general approach that should make this process less cumbersome.
More important though - we now have a compelling reply when asked "what was wrong with the RC
scripts the way they were?"
Technocrati Tags:
Sun
Solaris
SMF

Tuesday March 20, 2007
Zones in a Flash - Literally
Fantastic improvements have been made in the Solaris installation and upgrade
process - even more in OpenSolaris (available in the various community releases).
As we
examined the cloning feature introduced in Solaris 10 11/06, it became apparent that we
have stumbled upon a most intriguing capability. When combining
zone cloning with the attach/detach capability we have discovered a model for
flashing zones: zoneflash.
In a recent boot camp we took a look at this in more detail. Unfortunately the
slides (which will be posted soon) didn't quite follow the level of depth we were
exploring. Several people asked for notes on how this works - and here they are.
The irony is that it will take longer to read about it than it does to perform
the actual process - but it is so cool.
The Promise
We start with a fresh Solaris system. In this case just live upgraded from media, but it
could have been jumpstarted from media or a flash archive. The key point here is
that the system has had very little done to it, other than naming and some
software installation. Since zone attach makes sure that key system components
(specifically packages and patches) are compatible, it makes sense to build our
flashzones on a system that will look similar to those that will be built in the
future.
So how many zones will we build ? That's a good question. If this were
system flasharchives the answer would be as few as possible - one per architecture
in the most efficient case. But these zoneflashes are different - just applications,
some metadata, and perhaps some customizations (naming, security, SMF). It seems
reasonable to create one zoneflash for each type of application server you
would deploy - think of it as a userspace template. In this example I have chosen four: a blank uncustomized flash
(for building a new zoneflash in a flash), database server (MySQL),
web server (apache2), and the community edition of webmin (just another application).
Our procedure will be to build a minimal default zoneflash, run it through first boot
to populate the SMF repository, and then clone it for the remaining zoneflashes.
Each of these will be booted, customized for the particular application, and tested to
make sure everything is operating properly.
We will then detach the zones and move the detached zoneroots onto some media that can be
transported. Of course, keeping with the theme of zones and flash, the transport
could be the flasharchive itself. How cool would it be to jumpstart a server
using flasharchives and have all the application zones already present in a known
location, such as /zoneflash ? Unfortunately, I'm sitting in seat 18A on an
American Airlines flight to Los Angeles and don't quite have the required infrastructure
to do that sort of test. But I do have a USB stick and multiple boot environments.
That will do nicely.
Once attached, we will clone the zoneflashes as necessary, adding resources (network, local filesystems)
and attributes (resource controls) required for the proper operation of the application. When finished we
will detach the zoneflashes so they may be used elsewhere.
The Turn
The first step is to build and boot a simple generic sparse root zone. Since this zone isn't really meant for
operation, most zonecfg attributes (network configuration, resource limits, et al) will be
skipped. We will add them later when we build the real zones - remember, these are just
user space application templates.
# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create
zonecfg:flashdefault> set zonepath=/local/default
zonecfg:flashdefault> add inherit-pkg-dir; set dir=/opt; end
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
#
# zoneadm -z default install
A few minutes later we have an installed zone, ready for first boot. Since I've attended my Solaris Zones
Best Practices class, or at least read the materials, I know how to build a sysidcfg file that will satisfy
the sysidtool first boot service. This will allow the zone to boot up all the way without any additional
console interaction. Let's do that for our new zone.
# echo > /local/default/root/etc/sysidcfg <<EOF
name_service=NONE
nfs4_domain=dynamic
security_policy=NONE
root_password=xxxxxxxxxx You supply your own encrypted string from /etc/shadow, I'm not going to post mine!
system_locale=C
terminal=ansi
timezone=US/Central
network_interface=NONE {hostname=default}
EOF
# zoneadm -z default boot
# zlogin -C default
We need to let first boot processing complete. Since we supplied a valid sysidcfg, it is just
a matter of waiting for manifest-import and sysidtool to complete their magic. When complete,
login in and take a look around to make sure all is well. Once satisfied, shut down the zone
(either from inside the zone or from the global zone) - we are through with it for now.
(from the global zone)
# zoneadm -z default halt
Now we are done with this first zone. Time to clone it for our remaining application zones.
Please pardon a bit of inline shell scripting - I hate to type the same thing over and over and over.
Sort of makes for a nice script template, doesn't it ? Not quite the sophistication of Brad Digg's
zonemanager, but it will do nicely for our example.
# for zone in webmin mysql web
? do
echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
zoneadm -z ${zone} clone default
echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
zoneadm -z ${zone} boot
done
#
What in the heck was that all about ? OK, one more time - line by line with annotation.
# for zone in webmin mysql web
do
A quick interactive loop for the creation of three application zones. The variable ${zone} will be set to
the name of the zone we are trying to construct.
echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
A one liner that creates a new zone configuration based on the already existing
default. At this point the only thing we need to change is the zonepath, and it should be
set to /local/${zone}.
zoneadm -z ${zone} clone default
We recognize this as a zone cloning operation. The zone root is copied and a /reconfigure is
created in the new zone root so that sysidtool performs a complete configuration on first boot.
If you happen to be running on a recent release of OpenSolaris, you can put your zoneroot on
ZFS and the cloning operating will only take a few seconds and very little additional disk space will
be required. Those of us on Solaris 10 11/06 will have to wait for the 160MB or so to be copied.
Still better than the 9 minutes to go through a complete zone installation.
echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
This step creates a custom sysidcfg file for each zone. Remember to supply your own root
password from /etc/shadow in the global zone. This answers all of the sysidtool
questions, including the NFSv4 question.
zoneadm -z {zone} boot
Boot the zone. If we have done everything correctly, the next interaction will be
with console login.
done
Close the for loop in the interactive script. This process will take a few minutes
on Solaris 10 11/06, or if we are being clever with OpenSolaris and ZFS - a few seconds.
Now for the hard part - customizing the individual application zones. Well, it's not
all that difficult. And if you do this regularly, you probably have scripts to do
most of the work. It's just individual application installation and customization.
Here is what I did for my example zones.
MySQL
The installation instructions for the Solaris 10 MySQL can be found in /etc/sfw/mysql/README.solaris.mysql.
There is a typo in the Solaris 10 version of the README. It will cause a lot of grief if you cut and
paste without looking at the results. Fortunately it has been corrected in nevada (aka OpenSolaris
Community Edition).
Boot the mysql zone and log in as root.
# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql This line is incorrect in the Solaris 10 README - my chmod works better with two arguments
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf
The installation instructions continue by linking the start script into /etc/rc3.d. Since we are big SMF fans in these parts, let's do that instead. Feel free to use
my MySQL manifest as it contains a couple of
cool features (value and action authorizations - more on that later).
Since the mysql zone doesn't have any networking configured, perform this next step from the global zone.
If you already have a suitable manifest, or have stashed mine away somewhere in the global zone you can
use that instead.
# cd /local/mysql/root/var/svc/manifest/application
# wget http://blogs.sun.com/bobn/resource/mysql.xml
It's probably a good idea to make sure that all of this is working properly. Either reboot the
mysql zone, run the manifest-import service manually, or run svccfg import on the new manifest.
Your choice. What you should see upon completion is
# svcs mysql
STATE STIME FMRI
online 14:41:19 svc:/application/mysql:default
# /usr/sfw/bin/mysqladmin status
Uptime: 459 Threads: 1 Questions: 2 Slow queries: 0 Opens: 6 Flush tables: 1 Open tables: 0 Queries per second avg: 0.004
We're done for now. Unless of course you want to go for some extra credit. In that case
- Set up a web server with PHP support. Apache 1 plus the SFWmphp package from the Solaris Companion
will do just fine.
- Download and unpack phpMyAdmin in the webserver htdocs directory.
- Create a user with the mysql.operator authorization
- Create a user with the mysql.administrator authorization
Shut down the mysql zone.
Web
This is about as easy as it gets. Boot the web zone and perform the following steps.
# cp /etc/apache2/httpd.conf-example /etc/apache2/httpd.conf
# svcadm enable apache2
A quick check to make sure all is well.
# svcs apache2
STATE STIME FMRI
online 17:17:41 svc:/network/http:apache2
# telnet localhost 80
Trying ::1...
telnet: connect to address ::1: Network is unreachable
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>501 Method Not Implemented</title>
Connection to localhost closed by foreign host.
We're done for now. Shut down the web zone.
Webmin
This one is a little more complicated. We did this one last time in the zone cloning, but it is worth a second look.
Our task here is to replace the Solaris webmin with the latest download from
http://webmin.com The technique we are using will allow us to install a custom version of
an application into a sparse root zone. Specifically, webmin.com's package installs into /opt/webmin, but
/opt is a read-only inherited-pkg-dir. The easiest solution for this would be the creation of a symbolic
link in the global zone /opt to point to a location that can be safely written by each non-global zone.
In my example that would be /local-pkgs.
In the global zone, create the link in /opt, create the local package directory in the webmin zoneroot, and
download the latest webmin package.
# ln -s ../local-pkgs/webmin /opt/webmin
# mkdir -p /local/webmin/root/local-pkgs/webmin
# cd /local/webmin/root/var/tmp
# wget http://prdownloads.sourceforge.net/webadmin/webmin-1.330.pkg.gz
# gunzip webmin-1.330.pkg.gz
Now boot the webmin zone and log in as root.
# zoneadm -z webmin boot
# zlogin webmin
Remove the Solaris webmin packages (SUNWwebminu SUNWwebminr). The usr package needs to be
removed twice - the first pkgrm will leave it as a partially installed package, the second
will completely remove it - at least as far as our zone (and future patching) is concerned.
Once removed, install the webmin.com version, which should be conveniently located in /var/tmp.
# pkgrm SUNWwebminu SUNWwebminr SUNWwebminu
# pkgadd -d /var/tmp/webmin-1.330.pkg
We are done with this zone. Shut it down.
Detach
We have just built four zones: an empty zone suitable for future customizations, one with the Solaris webmin
replaced by the community edition, one with a working MySQL database, and one with a webserver. The last
task to be performed on these zones in their current state is to be detached, another new feature in Solaris 10 11/06.
Zone detach will copy the zone configuration into the zoneroot (to be used with a subsequent zone attach)
and sets the current zone state to configured. You can even delete the zone configurations as a final
cleanup prior to building a flash archive.
# zoneadm -z default detach
# zoneadm -z webmin detach
# zoneadm -z mysql detach
# zoneadm -z web detach
# zonecfg -z default delete -F
# zonecfg -z webmin delete -F
# zonecfg -z mysql delete -F
# zonecfg -z web delete -F
And flash
Unless the person in 18B wants to be a jumpstart server, we will have to simulate jumpstart/flasharchive
process. We can do this by booting into an alternate boot environment and then
delivering the detached zoneroots by some sort of shared or removable storage - something like a USB memory stick.
When we are done with this exercise, our zoneflashes will still be on the memory device, ready for their next use. Since the zones will never be booted, just cloned, the speed of the memory device really isn't important.
We need to prepare the USB memory stick (currently formatted as FAT16). We will use rmformat -l to
locate the device, fdisk to put a proper label on it, finally newfs for installing a proper file system.
ZFS would be interesting, but it would just get in our way later.
# rmformat -l
Looking for devices...
1. Logical Node: /dev/rdsk/c2t0d0p0
Physical Node: /pci@0,0/pci1179,1@1d,7/storage@4/disk@0,0
Connected Device: USB DISK 2.0 PMAP
Device Type: Removable
Bus: USB
Size: 984.0 MB
Label:
Access permissions:
2. Logical Node: /dev/rdsk/c1t0d0p0
Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
Connected Device: TEAC DW-224E-A 7.2A
Device Type: CD Reader
Bus: IDE
Size:
Label:
Access permissions:
# fdisk /dev/rdsk/c2t0d0p0
3 (to delete the existing partition)
1 (to create a new Solaris partition)
5 (to exit and write the new label)
# newfs /dev/rdsk/c2t0d0s2
newfs: construct a new file system /dev/rdsk/c2t0d0s2: (y/n)? y
/dev/rdsk/c2t0d0s2: 2009088 sectors in 981 cylinders of 64 tracks, 32 sectors
981.0MB in 62 cyl groups (16 c/g, 16.00MB/g, 7680 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 32832, 65632, 98432, 131232, 164032, 196832, 229632, 262432, 295232,
1705632, 1738432, 1771232, 1804032, 1836832, 1869632, 1902432, 1935232,
1968032, 2000832
# mkdir /tmp/flash
# mount /dev/dsk/c2t0d0s2 /tmp/flash
# cd /local
# find default webmin web mysql -print | cpio -pdum /tmp/flash
# umount /tmp/flash
We are now done with the original system. At this point we would create a flasharchive (with the detached
zoneroots in a convenient place in the archive).
The Prestige
The final act in our magic trick is the delivery. Specifically the transport, reattachment, and subsequent cloning of the zoneflashes on a new system. 18B is now asleep and I really don't want to disturb him, so I'll do this part myself. I'll boot my
laptop into another boot environment - built from the same media using the same Live Upgrade method as the boot environment
that created the zones.
We begin by mounting the removable media (USB memory stick) that contains the zoneflash. Do take a look around, it is quite likely that our friend volfs has already done this for us. Remember - if we were using a flasharchive to deliver the zoneflash this step would be unnecessary.
# mkdir /flash
# mount /dev/dsk/c2t0d0s2 /flash (we used rmformat -l to derive the device name)
Now that our zoneflashes have arrived, time to reattach them. The first step is to create zone configurations. If you recall, these were stored in the zoneroot when they were detached. The zonecfg command create -a is used to
retrieve the stored configuration information and adapt it to the new system - specifically the new location
of the zoneroot. Once configured we use zoneadm attach to reconnect them.
The sequence to reattach our default zone, now called flashdefault, would look something like this.
# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create -a /flash/default
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
# zoneadm -z flashdefault attach
We'll be a little more clever attaching the other three zones.
# for zone in webmin web mysql
do
echo "create -a /flash${zone}" | zonecfg -z flash${zone}
zoneadm -z flash${zone} attach
done
At this point our zoneroots are still on the USB memory device - but don't worry, these zones will never be booted. Their only purpose is to deliver preconfigured zones. We will use zone cloning to create our real application zones.
Which we will now do. It is very convenient to use the flashzone as a template for our new zone in case there were some special attributes like limitpriv that we might want to preserve. We will also need to add items that were not present in the zoneflashes - specifically networking and local file systems. Once we are satisfied with the zone configurations we
will clone the zoneflash. If we are only building one of each type of zone we can detach the zoneflash so that other
administrators can use it on their systems.
Let's do this for the mysql zone.
# zonecfg -z mysql
mysql: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:mysql> create -t flashmysql
zonecfg:mysql> set zonepath=/zones/mysql
zonecfg:mysql> add net; set physical=e1000g0; set address=192.168.100.102/24; end
zonecfg:mysql> add fs; set dir=/export; set special=/export; set options=[rw,nosuid,nodevices]; set type=lofs; end
zonecfg:mysql> commit
zonecfg:mysql> exit
# zoneadm -z mysql clone flashmysql
Copying /flash/mysql...
# zoneadm -z flashmysql detach
# echo "name_service=NONE" > /zones/mysql/root/etc/sysidcfg
# echo "nfs4_domain=dynamic" >> /zones/mysql/root/etc/sysidcfg
# echo "security_policy=NONE" >> /zones/mysql/root/etc/sysidcfg
# echo "root_password=xxxxxxxxxxx" >> /zones/mysql/root/etc/sysidcfg
# echo "system_locale=C" >> /zones/mysql/root/etc/sysidcfg
# echo "network_interface=NONE {hostname=mysql}" >> /zones/mysql/root/etc/sysidcfg
# echo "terminal=ansi" >> /zones/mysql/root/etc/sysidcfg
# echo "timezone=US/Central" >> /zones/mysql/root/etc/sysidcfg
And for the finale - boot the newly flashed mysql zone and you should see an enabled and operating mysql service.
# zoneadm -z mysql boot
# zlogin -C mysql
[Connected to zone 'mysql' console]
Hostname: mysql
Creating new rsa public/private host key pair
Creating new dsa public/private host key pair
Mar 20 06:15:44 mysql sendmail[1719]: My unqualified host name (mysql) unknown; sleeping for retry
Mar 20 06:15:44 mysql sendmail[1722]: My unqualified host name (mysql) unknown; sleeping for retry
mysql console login: root
Password:
Last login: Mon Mar 19 17:10:10 on console
Mar 20 06:15:49 mysql login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.11 snv_57 October 2007
#
# svcs mysql
STATE STIME FMRI
online 6:31:28 svc:/application/mysql:default
# /usr/sfw/bin/mysqladmin status
Uptime: 8 Threads: 1 Questions: 1 Slow queries: 0 Opens: 6 Flush tables: 1 Open tables: 0 Queries per second avg: 0.125
How cool is that ? Not only did we clone the zone, but since the database is in /var, it was cloned as well. Perhaps not practical for every situation, but still pretty cool.
I will leave the flashing of default, web, and webmin as an exercise to the reader. Follow the sequence we used for the mysql zone and you should have four working zones, built from a flash like mechanism that can be delivered via removable media, flasharchive, or shared storage.
Next time we'll take a closer look at MySQL and explore running it as a less privileged user. We'll also look at the action and value authorizations.
Technocrati Tags:
Sun
Solaris
Virtualization
Zones
Containers

Friday February 16, 2007
Cloning Isn't Just for Sheep Any More
While it may not have the social implications nor headline appeal of the
now famous Dolly the Sheep, the zone cloning feature introduced with
Solaris 10 11/06
is worth further investigation. Before we do that, it is probably a good
idea to review basic zone creation and installation prior to the new
cloning capability.
Building Zones the Old Fashioned Way
The first step in the creation of a zone is establishing it's configuration.
This is done by conversing with our friend,
zonecfg(1M), who handles all
the details of writing the configuration xml file in /etc/zones and updating
the zones index file /etc/zones/index.
Such a conversation might go something like....
# zonecfg -z zone1
zone1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone1> create
zonecfg:zone1> set zonepath=/zones/zone1
zonecfg:zone1> add inherit-pkg-dir; set dir=/opt; end
zonecfg:zone1> add net; set physical=iprb0; set address=192.168.100.101/24; end
zonecfg:zone1> add fs; set dir=/export; set special=/export; set options=[rw,nosiud]; set type=lofs; end
zonecfg:zone1> verify
zonecfg:zone1> commit
zonecfg:zone1> exit
#
If you grok zones then you recognize this as a typical sparse root zone. If you have attended one
of my zones best practices workshops then you will also notice that I'm following my own advice
and making /opt an inherited package directory.
A quick check to make sure all is well.
# zoneadm list -cv
ID NAME STATUS PATH BRAND
0 global running / native
- zone1 configured /zones/zone1 native
All is as it should be (which is always the case for a how-to example).
The next step is a rather magical affair where the zoneroot is populated. This process
is initiated by uttering the following sequence
# zoneadm -z zone1 install
Once spoken, fantastic things start happening behind the scenes - all of them by our good friend
Live Upgrade. The actual sequence of events is something like
- Create the new zoneroot if it doesn't already exist. If it does exist make sure
the permissions are set to 700 and it is owned by [0,0].
- Mount all of the inherit-pkg-dir and file systems listed in the zone configuration
file.
- Create a candidate list of files for the new zoneroot by looking at the
global zone contents file /var/sadm/install/contents.
On my laptop daily driver, this totals approximately 2 million files.
- Pick from this list all files that should be delivered to the new zone root by
removing all files from packages that are marked as global zone only (SUNW_PKG_THISZONE is
set to true)
We're still over 2 million files, folks!
- From the remaining list of files, remove all of those that will be delivered via
inherit-pkg-dir directories.
This is why I like inherit-pkg-dir. We are now down to about 2,300 files. If not
for inherit-pkg-dir I would be hitting my boss up for a lot more storage.
- Copy all of these files from the global zone into the new zoneroot, replacing
commonly edited configuration files with those that were originally delivered with
the package (ie /etc/passwd).
- Once the files are in place there is one more step to perform. Some of the
package have preinstall and postinstall scripts that might do something important.
These need to be run, even if all of the files are delivered via inherited directories.
So in package dependency order, all of the packages identified as applicable to the
new zone (SUNW_PKG_THISZONE=false) are installed sequentially.
- Update the zones index file /etc/zones/index marking the new zone as installed.
- Unmount all of the file systems mounted in step 2.
And we are done, with the first part. The amount of time this takes can be estimated as
O(sparseness, number of packages, disk speed).
To speed up this process I would have to increase the degree of sparseness, which is pretty
hard to do once /opt has been added. I could also decrease the number of packages in
the global zone - this has some interesting possibilities. I could also get faster disks, but
that isn't always practical, especially with a small server configuration or a home system. I may be
talking myself into a minimal global zone installation with full root zones - but that sounds like a topic for another
day.
Enough of the theory, how long did this really take ?
On a relatively clean Nevada (aka
OpenSolaris Community Edition) install it was almost 10 minutes. The output is below
and I have annotated it with the installation steps outlined above.
# time zoneadm -z zone1 install
[1] [2] Preparing to install zone .
[3] [4] [5] Creating list of files to copy from the global zone.
[6] Copying <1934> files to the zone.
[7] Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1290> packages on the zone.
Initialized <1290> packages on zone.
[8] [9] Zone is initialized.
Installation of <1> packages was skipped.
Installation of these packages generated warnings:
The file contains a log of the zone installation.
real 9m38.951s
user 1m26.582s
sys 2m51.252s
But we're still not done, are we ? We still have first boot processing which includes
initial population of the SMF repository (which is
O(number of services, speed of disks))
and system identification (which is either constant if a sysidconfig file is supplied or
O(Bob's increasingly bad typing rate) if we choose an interactive dialog).
For this example the first boot process took about 3 minutes to complete.
We now have a pristine zone, ready for work. But there is more to do,
isn't there ? We have to install some software, or at least configure software that is
already present. In fact, these customizations might be more complicated than the
zone installation process. If I had invested in developing automation scripts or was
using some form of advanced provisioning technology this might not be a big deal. If I'm
doing this manually then it may be quite a bit of work - and work that I don't want to
repeat with regularity. In other words: I'm not likely to use lots of zones and I don't
particularly look forward to OS updates.
Let's look at this a bit more and see if we can make this any easier.
This example comes from my (about to be posted) Zones workshop. In our new non-global
zone we will replace the Solaris version Webmin with the
community release from
webmin.com.
A quick
pkgchk(1M) of SUNWwebminu shows that its contents are in /usr/sfw and SUNWwebminr
deposits it's payload in /etc/webmin and an SMF manifest in /var/svc/manifest/application/management.
Performing the same task on the community edition of Webmin shows that it will install in
/etc/webmin and /opt/webmin. The clashing of /etc/webmin indicates that these cannot easily
co-exist, but complete replacement is possible (all inherit-pkg-dir destinations are
contained in a single directory).
So begin by removing the Solaris version of webmin. This is all done in our new zone.
# zonename
zone1
# pkgrm SUNWwebminr SUNWwebminu
pkgrm: ERROR: unable to remove
/usr/sfw/bin
/usr/sfw
/usr
## Updating system information.
Removal of partially failed.
At this point the root package SUNWwebminr is completely gone and SUNWwebminu is marked
as partially installed. One more
pkgrm(1M) and it will gone, at least as
far as our package contents are concerned. The bits in /usr/sfw are still there, but without
the configuration files in /etc/webmin, they are just that, bits in a directory.
# pkgrm SUNWwebminu
The following package is currently installed:
SUNWwebminu Webmin - Web-Based System Administration (usr)
(i386) 11.11.0,REV=2007.01.23.02.15
Do you want to remove this package? [y,n,?,q] y
## Removing installed package instance
(A previous attempt may have been unsuccessful.)
## Verifying package dependencies in global zone
## Processing package information.
## Removing pathnames in class
## Updating system information.
Removal of was successful.
Now to install the new webmin package. While you weren't looking I put the package in /var/tmp.
But there are some things to do before we can proceed. Remember, the package wants to
write into /opt/webmin, but /opt is read-only. We can do a couple of things: mount a
writable file system (LOFS, local real disk or NFS) onto /opt/webmin in our new zoneroot or
we could create a symbolic link for /opt/webmin that would point somewhere writable.
The link is much less confusing, so let's go that route this time.
In the global zone do something like
# ln -s /local/webmin /opt/webmin
# mkdir -p /zones/zone1/root/local/webmin
Now we are ready to proceed. In zone zone1, do the following
# pkgadd -d /var/tmp/webmin-1.320.pkg
The following packages are available:
1 WSwebmin Webmin - Web-based system administration
(all) 1.320
Select package(s) you wish to process (or 'all' to process
all packages). (default: all) [?,??,q]:
Webmin has been installed and started successfully. Use your web
browser to go to
http://zone1:10000/
and login with the name and password you entered previously.
Installation of was successful.
Now we have a nicely customized non-global zone with one application ready to go.
It wasn't all that bad, but there were a few manual steps. Multiply this by
20 or so for all of the other applications and configuration steps that you
need to do for your system standards and then by 20 or so for the numbers of
zones you want to provision and it is suddenly looking like a tremendous
amount of work.
Until Solaris 10 11/06.
Send in the Clones: Solaris Zone Cloning
Zone cloning is a new feature that bypasses all of the steps in the zone
installation process and replaces them by copying the source zoneroot and
performing a
sys-unconfig(1M). Of course this makes perfect sense - if
you duplicate the installation process you should get the exact same results
(a wise science teacher taught me that a long time ago). So why not
short cut the process and just copy the zoneroot,
sys-unconfig(1M), fix up the zones index
file and you are done.
But it gets better than that. If we are copying the zone root then any customization
performed on that zoneroot will be preserved. This includes the SMF repository. Not only
do we skip the initial import, we also preserve any customizations, such as service related
security hardening. Our new cloned zone would also have the community edition of Webmin
instead of the one in Solaris. And it's configured, enabled,
and will start automatically when the new zone boots - without requiring me to do anything
else. Now that's cool.
Let's see how all this works.
Step 1 - create a new zone configuration using our clone source as a template. We
need to change the zoneroot and IP address. In more complex configurations, other
attributes might need to be changed, but for this simple example this is all that is
required.
# zonecfg -z zone2
zone2: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone2> create -t zone1
zonecfg:zone2> set zonepath=/zones/zone2
zonecfg:zone2> select net address=192.168.100.101; set address=192.168.100.102/24; end
zonecfg:zone2> verify
zonecfg:zone2> commit
zonecfg:zone2> exit
Instead of installing a new zone, let's clone from zone1.
# time zoneadm -z zone2 clone zone1
WARNING: read-write lofs file system on '/export' is configured in both zones.
Copying /zones/zone1...
real 0m31.135s
user 0m0.431s
sys 0m3.818s
# zoneadm -z zone2 boot
# zlogin -C zone2 (or supply a sysidconfig file)
Now we're getting somewhere. Zone creation, including application configuration
and setup is reduced from about 15 minutes down to 31 seconds. This is getting really
cool.
Clones to the left of me, zpools to the right
But wait, there's more! There's an opportunity to make this even more efficient by
taking advantage of ZFS clones. Note that this is only available in OpenSolaris at
present, but consider the implications of the following example.
Note the use of zone relocation (move) - also a nifty new feature in Solaris 10 11/06.
# zpool create zfs_zones c4d0t0s2
# zoneadm -z zone1 move /zfs_zones/zone1
A ZFS file system has been created for this zone.
Moving across file systems; copying zonepath /zones/zone1...
Cleaning up zonepath /zones/zone1...
# zonecfg -z zone3
zone3: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone3> create -t zone1
zonecfg:zone3> set zonepath=/zfs_zones/zone3
zonecfg:zone3> select net address=192.168.100.101; set address=192.168.100.103/24; end
zonecfg:zone3> verify
zonecfg:zone3> commit
zonecfg:zone3> exit
# time zoneadm -z zone3 clone zone1
WARNING: read-write lofs file system on '/export' is configured in both zones.
Cloning snapshot local/zone1@SUNWzone3
Instead of copying, a ZFS clone has been created for this zone.
real 0m11.402s
user 0m0.380s
sys 0m0.412s
Wow! Under 12 seconds and a completely configured and ready to run zone is
built. Throw in a sysidconfig file and we're ready to run. And by using a ZFS
clone, almost no additional disk space was required for this new zone.
# df -k | grep zfs_zones
zfs_zones 1007616 27 808469 1% /zfs_zones
zfs_zones/zone1 1007616 198590 808469 20% /zfs_zones/zone1
zfs_zones/zone3 1007616 198592 808469 20% /zfs_zones/zone3
1GB - 200MB - 200MB should be 600MB, but it's not. Since the zones are nearly
identical at this point, only 200MB is consumed from the zpool.
Practical applications of Zone Cloning
Development environments and testbeds seem a very good fit. Build one standard
configuration of a zone and clone it as necessary for each developer or test
scenario. If things go wrong, which can happen while testing, just delete the
zone and re-clone it. 30 seconds later you are back in business.
Shhhh - don't tell anyone, but I like the privilege restrictions of zones. I'm very
likely to give a developer the root password to the zone and let them do what they
need to do. The worst they can do is destroy their environment. The impact to
me is two
zoneadm(1M) invocations and about 30 seconds of clock time.
The better use case comes when you combine this with another new feature in
Solaris 10 11/06. Zone migration. Imagine the following scenario.
- Mount a file system containing a company standard non-global zoneroot
- Attach the zone to the system (zonecfg create -a and zoneadm attach)
- Clone this new zone as many times as needed
- Detach the original zone from the server (zoneadm detach)
- Unmount the detached zoneroot filesystem
This sounds a lot like jumpstart and flasharchives, doesn't it ? You bet it does,
and it has many of the same benefits. The flashzone (I'm making up this phrase) can
be delivered via USB stick, NFS file services, network file copy (scp), embedded in
a server flasharchive. The possibilities are very intriguing.
I hope that this has helped introduce you to a few new zones features with
Solaris 10 11/06 (and one in OpenSolaris). As I ponder the combination of these new
features I find myself beginning to think that a minimal global zone and cloned full
root zones may in fact be a superior practice. We'll explore that in more detail soon.
Technocrati Tags:
Sun
Solaris
Virtualization
Zones
Containers

Thursday May 25, 2006
What's in a name? that which we call a zone
What's in a name? that which we call a zone
By any other name would virtualize as complete;
One of the most common questions raised during boot camps and other Solaris briefings deals
with the subject of containers and zones. There seems to be some confusion as the terms
appear to be used interchangeably. Yes, they are related - specifically a zone is a new
type of container introduced in Solaris 10, but containers have their origins much earlier.
The 1913 Webster dictionary defines a Container as
Container \Con*tain"er\, n.
1. One who, or that which, contain
which provides the foundation of the Solaris container. Quite simply, a Solaris container
is any method by which the resources of an application can be controlled (contained). I
suppose the origins of the container could date back to the earliest days of Solaris 2 with
the introduction of the processor_bind(2) system call and the pbind(1M) administrative command.
These controls were somewhat cumbersome for all but specific workloads and a bit primitive to
be called a container.
The container became a recognizable entity with the introduction of the Fair Share Scheduler (FSS)
in the Solaris 2.6 timeframe. We had a new scheduler class and a relatively easy to use
framework to label and control resource usage for complex applications. So we had a container
(project), but it was an unbundled product - so not quite a Solaris container.
When did Solaris get a container ? When the Solaris Resource Manager (SWM) became bundled in
Solaris 9. Every instance of Solaris had the capability to control resource usage
of nearly every application. Why didn't we call it a container in Solaris 9 ? We
only had one type of container (a project), so it wasn't really necessary to give it two different labels.
With the introduction of Solaris 10, we have a new type of container, the
Solaris zone.
Solaris zones are a virtualization technology that adds a security barrier around each user space instance.
We now have two orthogonal application controls: security and resource limits. The name containers was
introduced to describe both of these technologies.
So is a zone a container ? Absolutely. As are Solaris Resource Management projects and resource pools. And
container technologies can be combined to provide several dimensions of application controls (virtualized
user space object, resource caps, resource guarantees). Perhaps there will be other types of containers in the future, but for the
moment we have three very interesting technologies that can all wear name container.
Technocrati Tags:
Sun
Solaris
Zones

Monday May 22, 2006
To zone, or not to zone
To zone, or not to zone: that is the question:
Whether 'tis nobler in the mind of the administrator to suffer
The slings and arrows of outrageous utilization,
Or to take arms against a sea of application consolidations
One of the most interesting (and often hotly debated) questions raised while planning the adoption of Solaris 10 is
when to deploy applications in zones. You can almost hear Howie Mandel asking: zone, or no zone? Some early adopters of
Solaris 10 didn't includes zones in their Standard Operating Environment (SOE) certifications,
preferring to consider their use later after the new OS environments have been deployed and
their comfort level with Solaris 10 improved. There is wisdom in this approach, but perhaps
the time is right to reconsider this question.
As with any new technology there are trade-offs that should be considered before committing
to a course of action. In the case of Solaris Zones, the considerations aren't quite as
complicated as they may seem - in fact they can be reduced to the following question
- Am I upgrading on existing hardware or installing on new hardware ?
This is the most important question, for several reasons.
If you are going to upgrade to Solaris 10 from a previous release and not change the hardware
then the most efficient method is to use . Create a new boot environment, install Solaris
10 in the new set of disk slices, and let Live Upgrade manage all of the details of the upgrade
(users, file systems, network settings, etc). The upgrade can occur with the applications are
running in the current environment, so there is little impact. The previous Solaris environment
can be quickly restored if problems are discovered in the new Solaris 10 installation, so the level
of risk is minimized.
At present, Live Upgrade is not supported on a system with local zones, but if you are coming from
Solaris 8 or 9 you won't have local zones, so this restriction is rather moot. Conversely, if you
are installing on new hardware then you won't be using Live Upgrade, at least not initially.
So if you are upgrading on existing hardware then don't deploy zones initially. Perform
the upgrade (using Live Upgrade) and once the new environment has settled down, start planning
the migration of the existing applications into a zone, at a time that is convenient.
- Can the application run correctly in a local zone ?
The first question considered the most efficient approach, but we still must consider the
feasibility of running applications in zones. And there are a few considerations.
Nonglobal zones have a reduced set of privileges that may cause some applications to fail. An example
would be something like a DHCP server that requires raw IP access to communicate with systems that
don't have IP addresses. Since this privilege doesn't exist in a local zone (at least until
we get configurable privileges and per-zone IP stacks) then this type of application will not work in a local zone.
Some applications that don't appear to work with nonglobal zones may work with a little bit of creativity. An
example would be the NFS server - it does work in a nonglobal zone. But that doesn't mean that you can't share
data from a nonglobal zone, you just have to use the NFS server in the global zone.
Use a writable loopback filesystem between the global and nonglobal
zone and share the directory using an NFS server in the global zone. Users in the nonglobal zone can modify and share data, just as if
NFS server were running locally. Another example would be a backup client. It may be unnecessary to run a backup client in a nonglobal zone since all files are visible from the global zone. This can also be true for performance
data collectors, and actually an interesting design goal for intrusion detection.
And that's really about it. If the application can run in a nonglobal zone and it's convenient to do so,
why not ? Let's hear the case of the single nonglobal zone arguments.
- Prosecution: You can't JumpStart a server with nonglobal zones.
Defense: It is true that the JumpStart installation environment doesn't have the services required to build
and manage zones. But that doesn't prevent you from developing a simple first boot service to
create an initial set of nonglobal zones. Leveraging SMF dependencies and service properties, you
can leave a nice log file behind to record what was done during zone creation (artifacts that
the scripts indeed ran as expected).
- Prosecution: It takes longer to patch a system with nonglobal zones.
Defense: That is true, but isn't it a really question of degree (ie, this is a civil case, not criminal)? With one nonglobal zone, the additional time required to patch the system is very small - and you can easily make the argument that if your maintenance window isn't large enough to support one nonglobal zone then it is really to small for even a global zone only installation.
But wait, there's more. With a nonglobal zone, it may be possible to have different patch levels or versions of applications. Zones are a user space abstraction so there's only one kernel (and devices), but it is possible to have different versions of nearly all of the user space components. Most applications are insulated from the kernel by libraries (such as libc), so this capability extends to applications as well as the basic OS components. The Branded Zones project in OpenSolaris extends this abstraction so that the user space components don't even have to be Solaris.
- Prosecution: Zones are more complicated to configure and administer.
Defense: This simply isn't the case. Spend some time with the Zones in a Day workshop and you will see how to script the creation of a nonglobal zone. You will also notice that the zone configuration contains a small subset of all of the platform configuration elements - the nonglobal zone doesn't even contain it's own IP address. Details such as IP multi-pathing (IPMP) or IP Quality of Service (IPQoS) are inherited from the global zone and only need to be configured once on the system. It is certainly less effort to administer two nonglobal zones than two separate servers, and even in the one nonglobal zone case, it's about break even.
From the viewpoint of an application, provisioning managers, such as Sun N1 Service Provisioning System, handle most of the platform details. Even for this form of automation, nonglobal zones represent the most efficient framework for provisioning applications. Reading the design documents for zone cloning and migration show that this will become even more efficient.
It's time for the defense to present their case.
- Defense: Once you have one nonglobal zone, it's easy to add a second zone.
The prosecution doesn't have much of a rebuttal. To consolidate future applications, existing applications deployed in the global zone would have to be migrated to a nonglobal zone, which requires a significant additional effort. If the applications are in production, this migration would be challenging and quite disruptive.
- Defense: All resource usage in a nonglobal zone can be measured.
Again, the prosecution stays silent. A privileged (root) user can circumvent project level accounting, making it difficult to guarantee that all workload is identified in project level reporting. Nonglobal zones do not see their projects, nor do they have the administrative rights to modify the associated projects, even if they could be observed.
- Nonglobal zones can be covertly audited
As with the preceding argument, a local zone would have no visibility into intrusion detection and auditing being run in the global zone, specifically security logs. This makes it impossible to cover your tracks if you compromise a zone. In fact, the lack of visible intrusion detection might influence a hacker to stay around a bit longer and leave more evidence that will assist future forensic analysis.
- Defense: Zones and CPU pools may allow lower costs for software licensing.
Many software partners, such as Oracle, consider the combination of processor pools and nonglobal zones as hard partitioning that may allow for the licensing of a subset of the available resources. Since the nonglobal zone lacks the privileges to change the processor pool configuration, even a rogue administrator or developer cannot invalidate the licensing that is being enforced from the global zone. Regular configuration audits are easy to run to insure future compliance.
We haven't heard from the prosecution lately - are they still here ? Bueller ? Bueller
- Defense: A compromise in a nonglobal zone doesn't compromise another zone (global or nonglobal).
The reduction of privileges in a nonglobal zone will prevent a compromise in one zone from affecting another zone. The only user space components that are shared between zones are file systems, and those can be protected somewhat by mount options (nodevice, nosuid, noexec). If a nonglobal zone is compromised, there is a limit on the promotion of privileges that isolates other zones from further compromises.
The prosecution was last seen downloading the latest Software Express and working on a first boot service to create nonglobal zones after a JumpStart installation.
Technocrati Tags:
Sun
Solaris
Zones

Thursday February 09, 2006
SMF manifest examples for Apache1 and MySQL
In the
Service Management in a Day workshop (and the earlier
Migrating a Legacy RC Service module from the Solaris Deep Dives) we examine the migration of MySQL from an RC script to a fully managed SMF service.
Why MySQL ? Well, it's a convenient way to point out that MySQL is included in Solaris 10. But the real reason is that it is rather simple and makes a great platform to show what SMF can really do for us - and it's certainly more than a one trick pony.
So let's set up MySQL and see where this goes. You will find the instructions in /etc/sfw/mysql/README.solaris.mysql, but be careful as there is a small error. The last time I looked, chmod -R requires two arguments, not one.
# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf
Let's start the database manually and make sure that all is well.
# /etc/sfw/mysql/mysql.server start
Starting mysqld daemon with databases from /var/mysql
# /usr/sfw/bin/mysqladmin status
Uptime: 32 Threads: 1 Questions: 1 Slow queries: 0 Opens: 6
Flush tables: 1 Open tables: 0 Queries per second avg: 0.031
Time for the first SMF value - resilient services. Let's terminate mysqld and see what happens.
# pkill mysql
# mysqladmin status
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)'
Check that mysqld is running and that the socket: '/tmp/mysql.sock' exists!
This is what we expect. When mysqld terminates, nobody is watching and it remains down until the next
reboot (or transition back to run level 3).
So what can SMF do for me here ? Paying attention to a non-transient service is a good start.
What we need now is a manifest for MySQL. You can take a look at
mine or if you follow the RC Service Migration howto then you will come up with something very close. Put
mysql.xml somewhere in /var/svc/manifest (application or local seem a good place, local probably being the best choice). Reboot or run the
manifest-import service method to make SMF aware of the new service definition
# svcs mysql
svcs: Pattern 'mysql' doesn't match any instances
STATE STIME FMRI
# /lib/svc/method/manifest-import
Loaded 1 smf(5) service descriptions
# svcadm enable mysql
# svcs mysql
STATE STIME FMRI
online 22:39:54 svc:/application/mysql:default
# mysqladmin status
Uptime: 4 Threads: 1 Questions: 1 Slow queries: 0 Opens: 6
Flush tables: 1 Open tables: 0 Queries per second avg: 0.250
Now, let's try that pkill thing again.
# pkill mysqld
# svcs mysql
STATE STIME FMRI
online 22:45:45 svc:/application/mysql:default
Now, if we watch the service log file which is convenientely located at /var/svc/log/application-mysql:default.log
you will see svc.startd notice that all of the processes have terminated, yet it isn't a transient service. So there
is a problem and the service should be restarted.
[ Feb 8 16:53:36 Stopping because all processes in service exited. ]
[ Feb 8 16:53:36 Executing stop method ("/etc/sfw/mysql/mysql.server stop") ]
No mysqld pid file found. Looked for /var/mysql/pandora.pid.
[ Feb 8 16:53:36 Method "stop" exited with status 0 ]
[ Feb 8 16:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ][ Feb 8 16:53:36 Method "start" exited with status 0 ]
This is pretty cool. We've made MySQL somewhat more available than it would have been straight out of the box. Does this eliminate the requirement for High Availability Clusters ? No, but it does open an interesting discussion.
In this example my observation of MySQL's availability is rather naive - if it's running it must be OK. For something like a database server you way want to connect and manipulate some tables to see if the service is really running. We should also note that SMF doesn't really handle the platform availability issues - so HA Clusters are still needed. But it's also interesting to note that many HA scripts only provide coverage for a subset of critical services, usually a database, but ignore the dozens of other services that are also required for proper operation of the service being clustered. Lacking a sophisticated dependency framework, a node failover occurs when one of these other services fails.
SMF provides such a framework, including the watchdog monitor (svc.startd) - and it does with really little effort on the part of the administrator or application packager.
But wait, there's more.
In a recent discussion over service minimization (the idea that you don't install software that you have no intention of running) a more subtle value of SMF can be observed. Solaris 10 allows us to separate the question of installation from activation. It's quite easy to install software and then verify that it is disabled. In fact a routine scan of service properties and a comparison against a baseline is a good idea.
Here is where a bit of creativity can be give us additional safeguards. A well developed SMF manifest will allow us to make an additional distinction. We can now observe the installation of a service, the configuration of a service, and whether or not that service has actually been activated.
How is this done ? A dependency on a configuration file is a good start. Let's look at the MySQL manifest and see how this was done.
<dependency
name='config_file'
type='path'
grouping='require_all'
restart_on='none'>
<service_fmri value='file://localhost/var/mysql/my.cnf' />
</dependency>
This is a dependency on a particular configuration file, in this case /var/sql/my.cnf. If this file is missing then then MySQL will not transition to online. If enabled it will immediately transition to the offline state and a check of svcs -l mysql will show the missing configuration file.
Now this is very cool indeed. For this service to be activated it must be installed, configured
and enabled. Failing to configure the service (consider the case of sshd which you probably don't want to run without a configuration file) will provide an obvious and easily observed error condition. This may change the way you look at service minimization.
The takeaway from this exercise is that as you plan your RC service migrations to SMF, add a dependency on an easily observed indicatation that the service has been properly configured, such as a configuration file.
This brings me to my next example, an
Apache 1 service manifest. We start by copying the Apache 2 service manifest at /var/svc/manifest/network/http-apache2.xml - as that seemed like a good place to start. I changed the service name, the documentation block, and the start/stop methods as before.
There's one new wrinkle - take a look at the following property group
<property_group name='httpd' type='application'>
<stability value='Evolving' />
<propval name='ssl' type='boolean' value='false' />
</property_group>
If we look at the Apache2 start method /lib/svc/method/http-apache2, you will see a query for this service property
ssl=`svcprop -p httpd/ssl svc:/network/http:apache2`
if [ "$ssl" = false ]; then
cmd="start"
else
cmd="startssl"
fi
;;
So this is how we enable SSL support for Apache 2. If we want to do something similar for Apache 1 then we will have to modify the start script /etc/init.d/apache. The other solution would be to remove the property group from the manifest and modify the start definition to call either /etc/init.d/apache start or /etc/init.d/apache startssl.
After you import this new manifest, please remember to unlink the start and stop links from all run level directories (there's one start in rc3.d and one kill in each run level).
This brings me to my last recommendation - using a configuration file dependency to help keep service instances separated. This is particularly important for the http service as all the executables are named httpd. By adding a dependency to the configuration file you have added an important documentation item that will come in handy when diagnosing service failures. If the instance fails and ends up in a maintenance state, a quick look at svcs -l will tell you which instance you need to investigate.
Where can I learn more about this ? The
OpenSolaris SMF Community would be a good place to look. In addition to the excellent articles on Solaris Service Management, there is a
repostory of contributed manifests that might help you get started. And you are invited to contribute manifests for your converted services - you might even receive a nice OpenSolaris trinket for your efforts.
Technocrati Tags:
Sun
Solaris
SMF

Wednesday November 09, 2005
Common First Time Mistakes - Containers
Containers in Solaris 10 features an interesting virtualization technology called zones. Local zones
are amazingly easy to configure and install, but there are a few things that can trip you up the first
few times.
- Local zones require system identification
Since each local zone has its own /etc, it can have a different identity from that of the global zone (timezone, locale,
root password). So we need to supply some basic configuration information about the local zone. If you are experienced with
Solaris you will recognize the system identification process that runs at first boot. Solaris 10 adds the NFS V4 domain
mapping question which must be answered in addition to supplying an /etc/sysidconfig file. We'll deal with that later.
The complication presented by local zones is that you can use zlogin(1) to enter a local zone before it has completed
it's system identification. This is not possible for the global zone, nor a prior Solaris release - so you may not even
consider this a possibility when diagnosing your first few zones configuration problems.
The symptoms are that you can enter the zone using zlogin(1), but nothing else works. You cannot get in via ssh,
rlogin, or telnet even through they have been configured properly (or so you believe at this point). Your first
step in diagnosis should be an svcs -a and you will see service states of unitialized. This is the clue!
If you look all the way back in the service states you will see there is a service called sysidtool (that calls
the service script /lib/svc/method/sysidtool-system). This is where system identification is done (and if you look at the
method you will discover how to answer the NFSV4 question).
The resolution is simple - connect to the local zone console using zlogin -C and answer the identification questions.
If you are using the Java Desktop System then terminal type 12 (xterms) will provide the best results.
You will also experience this problem if your sysidconfig file contains an error. The most common errors are incorrect specifications of the timezone and root password.
- Failure to answer the NFS V4 question
You can script the creation of a local zone and supply default identification through the use of an /etc/sysidconfig file. Experienced Solaris adminstrators will recognize this method from unattended jumpstart installs. Solaris 10 requires one additional configuration item that isn't satisfied by /etc/sysidconfig: the NFS V4 domain mapping question.
Automating the NFS V4 configuration requires 2 steps. First, specify the value of NFSMAPID_DOMAIN in $zonepath/root/etc/default/nfs. Finally, you need to create a file called $zonepath/root/etc/.NFS4inst_state.domain
to let sysidtool know that you have answered the question.
- The local zone root directory is $zonepath/root
If the lab equipment is sufficiently fast then we have a little competition in the Containers workshop. The challenge is to completely automate the installation of a local container and provision an application (typically Apache or MySQL) as well as set up root access via telnet, rlogin, or ssh - but do it in a single script with no intervention. Run the script and the next step is to connect to the provisioned service.
After 10 minutes of scripting work, and another 10 minutes for a local zone to install, there are always a few exclamation of "Doh" as the student realized that they dropped in a sysidconfig to $zonepath/etc rather than $zonepath/root/etc.
- Make sure the mountpoints exit for all file system being supplied by zoneadmd
Supplying lofs file systems via the zone configuration file (see zonecfg man page) is a convenient way to share files between zones (including the global zone). The advantage of this method is that zoneadmd performs the loopback mount from the privileged global zone as it readies the local zone, thus the local zone isn't permitted to undo this mount. If the mount point (in the local zone) does not exist then the zone will fail to boot.
- Network not being plumbed will cause a local zone to fail to boot
This one is rather unique to the mobile user. For convenience you may have your global zone boot without networking configured. Once you log in then you can run a simple script to plumb up your network interfaces based on how you need to connect (fixed IP address at home or in a lab, DHCP in a hotel, etc). If your local zones have network resources, which is typically the case, the network interfaces must be plumbed up in the global zone prior to booting the local zone. This one has gotten me more than once in a customer demonstration.
Technocrati Tags:
Sun
Solaris
Boot Camp
Workshop
Containers
Zones
Virtualization