How Cool is That ?
Bob Netherton's Weblog
20091009 Friday October 09, 2009
What's New in Solaris 10 10/09

Solaris 10 10/09 (u8) is now available for download at http://sun.com/solaris/get.jsp. DVD ISO images (full and segments that can be reassembled after download) are available for both SPARC and x86.

Here are a few of the new features in this release that caught my attention.

Packaging and Patching

Improved performance of SVR4 package commands: Improvements have been made in the SVR4 package commands (pkgadd, pkgrm, pkginfo et al). The impact of these can be seen in drastically reduced zone installation time. How much of an improvement you ask (and you know I have to answer with some data, right) ?
# cat /etc/release; uname -a

                        Solaris 10 5/09 s10x_u7wos_08 X86
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                             Assembled 30 March 2009
SunOS chapterhouse 5.10 Generic_141415-09 i86pc i386 i86pc

# time zoneadm -z zone1 install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <2905> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1453> packages on the zone.
Initialized <1453> packages on zone.
Zone  is initialized.
Installation of these packages generated errors: 
The file  contains a log of the zone installation.

real    5m48.476s
user    0m45.538s
sys     2m9.222s
#  cat /etc/release; uname -a

                       Solaris 10 10/09 s10x_u8wos_08a X86
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                           Assembled 16 September 2009
SunOS corrin 5.10 Generic_141445-09 i86pc i386 i86pc

# time zoneadm -z zone1 install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <2915> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1432> packages on the zone.
Initialized <1432> packages on zone.
Zone  is initialized.
Installation of these packages generated errors: 
The file  contains a log of the zone installation.

real    3m4.677s
user    0m44.593s
sys     0m48.003s
OK, that's pretty impressive. A zone installation on Solaris 10 10/09 takes about half of the time as it does on Solaris 10 5/09. It is also worth noting the rather large reduction in the amount of system time (48 seconds vs 129 seconds) too.

Zones parallel patching: Before Solaris 10 10/09 the patching process was single threaded which could lead to prolonged patching time on a system with several nonglobal zones. Starting with this update you can specify the number of threads to be used to patch a system with zones. Enable this feature by assigning a value to num_proc in /etc/patch/pdo.conf. The maximum value is capped at 1.5 times the number of on-line CPUs, but can be limited by a lower value of num_proc.

This feature is also available by applying Solaris patches 119254-66 (SPARC) or 119255-66 (x86).

For more information on the effects of zone parallel patching, see Container Guru Jeff Victor's excellent Patching Zones Goes Zoom.

ZFS Enhancements

Flash archive install into a ZFS root filesystem: ZFS support for the root file system was introduced in Solaris 10 10/08 but the install tools did not work with flash archives. Solaris 10 10/09 provides the ability to install a flash archive created from an existing ZFS root system. This capability is also provided by patches 119534-15 + 124630-26 (SPARC) or 119535-15 + 124631-27 (x86) that can be applied to a Solaris 10 10/08 or later system. There are still a few limitations such as the the flash source must be from a ZFS root system and you cannot use differential archives. More information can be found in Installing a ZFS Root File System (Flash Archive Installation).

Set ZFS properties on the initial zpool file system: Prior to Solaris 10 10/09, ZFS file system properties could only be set once the initial file system was created. This would make it impossible to create a pool with same name as an existing mounted file system or to be able to have replication or compression from the time the pool is created. In Solaris 10 10/09 you can specify any ZFS file system property using zpool -O.
 zpool create -O mountpoint=/data,copies=3,compression=on datapool c1t1d0 c1t2d0
ZFS Read Cache (L2ARC): You now have the ability to add persistent read ahead caches to a ZFS zpool. This can improve the read performance of ZFS as well as reducing the ZFS memory footprint.

L2ARC devices are added as cache vdevs to a pool. In the following example we will create a pool of 2 mirrored devices, 2 cache devices and a spare.
 
# zpool create datapool mirror c1t1d0 c1t2d0 cache c1t3d0 c1t4d0 spare c1t5d0

# zpool status datapool
  pool: datapool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        datapool    ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
        cache
          c1t3d0    ONLINE       0     0     0
          c1t4d0    ONLINE       0     0     0
        spares
          c1t5d0    AVAIL

errors: No known data errors
So what do ZFS cache devices do ? Rather than go into a lengthy explanation of the L2ARC, I would rather refer you to Fishworks developer Brendan Gregg's excellent treatment of the subject.

Unlike the intent log (ZIL), L2ARC cache devices can be added and removed dynamically.
# zpool remove datapool c1t3d0
# zpool remove datapool c1t4d0

# zpool status datapool
  pool: datapool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        datapool    ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
        spares
          c1t5d0    AVAIL

errors: No known data errors


# zpool add datapool cache c1t3d0

# zpool status datapool
  pool: datapool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        datapool    ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
        cache
          c1t3d0    ONLINE       0     0     0
        spares
          c1t5d0    AVAIL

errors: No known data errors
New cache control properties: Two new ZFS properties are introduced with Solaris 10 10/09. These control what what is stored (nothing, data + metadata, or metadata only) in the ARC (memory) and L2ARC (external) caches. These new properties are and they can take the values
# zpool create -O primarycache=metadata -O secondarycache=all datapool c1t1d0 c1t2d0 cache c1t3d0 
There are workloads such as databases that perform better or make more efficient use of memory if the system is not competing with the caches that the applications are maintaining themselves.

User and group quotas:ZFS has always had quotas and reservations but they were applied at the file system level. To achieve user or group quotas would require creating additional file systems which might make administration more complex. Starting with Solaris 10 10/09 you can apply both user and group quotas to a file system much like you would with UFS. The ZFS file system must be at version 15 or later and the zpool must be at version 4 or later.

Let's create a file system and see if we are at the proper versions to set quotas.
# zfs create rpool/newdata
# chown bobn:local /rpool/newdata

# zpool get version rpool
NAME   PROPERTY  VALUE    SOURCE
rpool  version   18       default


# zpool upgrade -v
This system is currently running ZFS pool version 18.

The following versions are supported:

VER  DESCRIPTION
---  --------------------------------------------------------
 1   Initial ZFS version
 2   Ditto blocks (replicated metadata)
 3   Hot spares and double parity RAID-Z
 4   zpool history
 5   Compression using the gzip algorithm
 6   bootfs pool property
 7   Separate intent log devices
 8   Delegated administration
 9   refquota and refreservation properties
 10  Cache devices
 11  Improved scrub performance
 12  Snapshot properties
 13  snapused property
 14  passthrough-x aclinherit
 15  user/group space accounting
 16  stmf property support
 17  Triple-parity RAID-Z
 18  snapshot user holds
For more information on a particular version, including supported releases, see:

http://www.opensolaris.org/os/community/zfs/version/N

Where 'N' is the version number.


# zfs get version rpool/newdata
NAME           PROPERTY  VALUE    SOURCE
rpool/newdata  version   4 

# zfs upgrade -v
The following filesystem versions are supported:

VER  DESCRIPTION
---  --------------------------------------------------------
 1   Initial ZFS filesystem version
 2   Enhanced directory entries
 3   Case insensitive and File system unique identifier (FUID)
 4   userquota, groupquota properties

For more information on a particular version, including supported releases, see:

http://www.opensolaris.org/os/community/zfs/version/zpl/N

Where 'N' is the version number.
Excellent. Now let's set a user and group quota and see what happens. We'll set a group quota of 1GB and a user quota at 2GB.
# zfs set groupquota@local=1g rpool/newdata
# zfs set userquota@bobn=2g rpool/newdata

# su - bobn

% mkfile 500M /rpool/newdata/file1
% mkfile 500M /rpool/newdata/file2
% mkfile 500M /rpool/newdata/file3
file3: initialized 40370176 of 524288000 bytes: Disc quota exceeded

As expected, we have exceeded our group quota. Let's change the group of the existing files and see if we can proceed to our user quota.
% rm /rpool/newdata/file3
% chgrp sales /rpool/newdata/file1 /rpool/newdata/file2
% mkfile 500m /rpool/newdata/file3
Could not open /rpool/newdata/disk3: Disc quota exceeded

Whoa! What's going on here ? Relax - ZFS does things asynchronously unless told otherwise. And we should have noticed this when the mkfile for file3 actually started. ZFS wasn't quite caught up with the current usage. A good sync should do the trick.
% sync
% mkfile 500M /rpool/newdata/file3
% mkfile 500M /rpool/newdata/file4
% mkfile 500M /rpool/newdata/file5
/rpool/newdata/disk5: initialized 140247040 of 524288000 bytes: Disc quota exceeded

Great. We now have user and group quotas. How can I find out what I have used against my quota ? There are two new ZFS properties, userused and groupused that will show what the group or user is currently consuming.
% zfs get userquota@bobn,userused@bobn rpool/newdata
NAME           PROPERTY        VALUE           SOURCE
rpool/newdata  userquota@bobn  2G              local
rpool/newdata  userused@bobn   1.95G           local

% zfs get groupquota@local,groupused@local rpool/newdata
NAME           PROPERTY          VALUE             SOURCE
rpool/newdata  groupquota@local  1G                local
rpool/newdata  groupused@local   1000M             local

% zfs get groupquota@sales,groupused@sales rpool/newdata
NAME           PROPERTY          VALUE             SOURCE
rpool/newdata  groupquota@sales  none              local
rpool/newdata  groupused@sales   1000M             local

% zfs get groupquota@scooby,groupused@scooby rpool/newdata
NAME           PROPERTY           VALUE              SOURCE
rpool/newdata  groupquota@scooby  -                  -
rpool/newdata  groupused@scooby   -   
New space usage properties: Four new usage properties have been added to ZFS file systems.
# zfs get all datapool | grep used
datapool  used                  5.39G                  -
datapool  usedbysnapshots       19K                    -
datapool  usedbydataset         26K                    -
datapool  usedbychildren        5.39G                  -
datapool  usedbyrefreservation  0                      -


These new properties can also be viewed in a nice tabular form using zfs list -o space.
# zfs list -r -o space datapool
NAME           AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
datapool        480M  5.39G       19K     26K              0      5.39G
datapool@now       -    19K         -       -              -          -
datapool/fs1    480M   400M         0    400M              0          0
datapool/fs2   1.47G  1.00G         0   1.00G              0          0
datapool/fs3    480M    21K         0     21K              0          0
datapool/fs4   2.47G      0         0       0              0          0
datapool/vol1  1.47G     1G         0     16K          1024M          0

Miscellaneous

Support for 2TB boot disks: Solaris 10 10/09 supports a disk Volume Table of Contents (VTOC) of up to 2TB in size. The previous maximum VTOC size was 1TB. On x86 systems you must be running Solaris with a 64bit kernel and have at least 1GB of memory to use a VTOC larger that 1TB.

pcitool: A new command for Solaris that can assign interrupts to specific threads or display the current interrupt routing. This command is available for both SPARC and x86.

New iSCSI initiator SMF service: svc:/network/iscsi/initiator:default is a new Service Management Facility (SMF) service to control discovery and enumeration of iSCSI devices early in the boot process. Other boot services that may require iSCSI services can add dependencies to insure that the devices are available before being needed.

Device Drivers

The following device drivers are either new to Solaris or have had some new features or chipsets added.

Open Source Software Updates

The following open source packages have been updated for Solaris 10 10/09.

For more information

A complete list of new features and changes can be found in the Solaris 10 10/09 Release Notes and the What's New in Solaris 10 10/09 documentation at docs.sun.com.

Technocrati Tags:

Oct 09 2009, 11:13:18 PM CDT Permalink Comments [1]

20090521 Thursday May 21, 2009
Getting Rid of Pesky Live Upgrade Boot Environments
As we discussed earlier, Live Upgrade can solve most of the problems associated with patching and upgrading your Solaris system. I'm not quite ready to post the next installment in the LU series quite yet, but from some of the comments and email I have received, there are two problems that I would like to help you work around.

Oh where oh where did that file system go ?

One thing you can do to stop Live Upgrade in its tracks is to remove a file system that it thinks another boot environment needs. This does fall into the category of user error, but you are more likely to run into this in a ZFS world where file systems can be created and destroyed with great ease. You will also run into a varient of this if you change your zone configurations without recreating your boot environment, but I'll save that for a later day.

Here is our simple test case:
  1. Create a ZFS file system.
  2. Create a new boot environment.
  3. Delete the ZFS file system.
  4. Watch Live Upgrade fail.

# zfs create arrakis/temp

# lucreate -n test
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
Comparing source boot environment <s10u7-baseline> file systems with the
file system(s) you specified for the new boot environment. Determining
which file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <test>.
Source boot environment is <s10u7-baseline>.
Creating boot environment <test>.
Cloning file systems from boot environment <s10u7-baseline> to create boot environment <test>.
Creating snapshot for <rpool/ROOT/s10u7-baseline> on <rpool/ROOT/s10u7-baseline@test>.
Creating clone for <rpool/ROOT/s10u7-baseline@test> on <rpool/ROOT/test>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/test>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <s10u6_baseline> as <mount-point>>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <test> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <nv114> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <route66> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <nv95> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <test> in GRUB menu
Population of boot environment <test> successful.
Creation of boot environment <test> successful.

# zfs destroy arrakis/test

# luupgrade -t -s /export/patches/10_x86_Recommended-2009-05-14  -O "-d" -n test
System has findroot enabled GRUB
No entry for BE <test> in GRUB menu
Validating the contents of the media </export/patches/10_x86_Recommended-2009-05-14>.
The media contains 143 software patches that can be added.
All 143 patches will be added because you did not specify any specific patches to add.
Mounting the BE <test>.
ERROR: Read-only file system: cannot create mount point </.alt.tmp.b-59c.mnt/arrakis/temp>
ERROR: failed to create mount point </.alt.tmp.b-59c.mnt/arrakis/temp> for file system </arrakis/temp>
ERROR: unmounting partially mounted boot environment file systems
ERROR: cannot mount boot environment by icf file </etc/lu/ICF.5>
ERROR: Unable to mount ABE <test>: cannot complete lumk_iconf
Adding patches to the BE <test>.
Validating patches...

Loading patches installed on the system...

Cannot check name /a/var/sadm/pkg.
Unmounting the BE <test>.
The patch add to the BE <test> failed (with result code <1>).
The proper Live Upgrade solution to this problem would be to destroy and recreate the boot environment, or just recreate the missing file system (I'm sure that most of you have figured the latter part out on your own). The rationale is that the alternate boot environment no longer matches the storage configuration of its source. This was fine in a UFS world, but perhaps a bit constraining when ZFS rules the landscape. What if you really wanted the file system to be gone forever.

With a little more understanding of the internals of Live Upgrade, we can fix this rather easily.

Important note: We are about to modify undocumented Live Upgrade configuration files. The formats, names, and contents are subject to change without notice and any errors made while doing this can render your Live Upgrade configuration unusable.

The file system configurations for each boot environment are kept in a set of Internal Configuration Files (ICF) in /etc/lu named ICF.n, where n is the boot environment number. From the error message above we see that /etc/lu/ICF.5 is the one that is causing the problem. Let's take a look.
# cat /etc/lu/ICF.5
test:-:/dev/dsk/c5d0s1:swap:4225095
test:-:/dev/zvol/dsk/rpool/swap:swap:8435712
test:/:rpool/ROOT/test:zfs:0
test:/archives:/dev/dsk/c1t0d0s2:ufs:327645675
test:/arrakis:arrakis:zfs:0
test:/arrakis/misc:arrakis/misc:zfs:0
test:/arrakis/misc2:arrakis/misc2:zfs:0
test:/arrakis/stuff:arrakis/stuff:zfs:0

test:/arrakis/temp:arrakis/temp:zfs:0

test:/audio:arrakis/audio:zfs:0
test:/backups:arrakis/backups:zfs:0
test:/export:arrakis/export:zfs:0
test:/export/home:arrakis/home:zfs:0
test:/export/iso:arrakis/iso:zfs:0
test:/export/linux:arrakis/linux:zfs:0
test:/rpool:rpool:zfs:0
test:/rpool/ROOT:rpool/ROOT:zfs:0
test:/usr/local:arrakis/local:zfs:0
test:/vbox:arrakis/vbox:zfs:0
test:/vbox/fedora8:arrakis/vbox/fedora8:zfs:0
test:/video:arrakis/video:zfs:0
test:/workshop:arrakis/workshop:zfs:0
test:/xp:/dev/dsk/c2d0s7:ufs:70396830
test:/xvm:arrakis/xvm:zfs:0
test:/xvm/fedora8:arrakis/xvm/fedora8:zfs:0
test:/xvm/newfs:arrakis/xvm/newfs:zfs:0
test:/xvm/nv113:arrakis/xvm/nv113:zfs:0
test:/xvm/opensolaris:arrakis/xvm/opensolaris:zfs:0
test:/xvm/s10u5:arrakis/xvm/s10u5:zfs:0
test:/xvm/ub710:arrakis/xvm/ub710:zfs:0
The first step is to clean up the mess left by the failing luupgrade attempt. At the very least we will need to unmount the alternate boot environment root. It is also very likely that we will have to unmount a few temporary directories, such as /tmp and /var/run. Since this is ZFS we will also have to remove the directories created when these file systems were mounted.
# df -k | tail -3
rpool/ROOT/test      49545216 6879597 7546183    48%    /.alt.tmp.b-Fx.mnt
swap                 4695136       0 4695136     0%    /a/var/run
swap                 4695136       0 4695136     0%    /a/tmp

# luumount test
# umount /a/var/run
# umount /a/tmp
# rmdir /a/var/run /a/var /a/tmp

Next we need to remove the missing file system entry from the current copy of the ICF file. Use whatever method you prefer (vi, perl, grep). Once we have corrected our local copy of the ICF file we must propagate it to the alternate boot environment we are about to patch. You can skip the propagation if you are going to delete the boot environment without doing any other maintenance activities. The normal Live Upgrade operations will take care of propagating the ICF files to the other boot environments, so we should not have to worry about them at this time.
# mv /etc/lu/ICF.5 /tmp/ICF.5
# grep -v arrakis/temp /tmp/ICF.5 > /etc/lu/ICF.5 
# cp /etc/lu/ICF.5 `lumount test`/etc/lu/ICF.5
# luumount test
At this point we should be good to go. Let's try the luupgrade again.
# luupgrade -t -n test -O "-d" -s /export/patches/10_x86_Recommended-2009-05-14
System has findroot enabled GRUB
No entry for BE  in GRUB menu
Validating the contents of the media .
The media contains 143 software patches that can be added.
All 143 patches will be added because you did not specify any specific patches to add.
Mounting the BE <test>.
Adding patches to the BE <test>.
Validating patches...

Loading patches installed on the system...

Done!

Loading patches requested to install.

Approved patches will be installed in this order:

118668-19 118669-19 119214-19 123591-10 123896-10 125556-03 139100-02


Checking installed patches...
Verifying sufficient filesystem capacity (dry run method)...
Installing patch packages...

Patch 118668-19 has been successfully installed.
Patch 118669-19 has been successfully installed.
Patch 119214-19 has been successfully installed.
Patch 123591-10 has been successfully installed.
Patch 123896-10 has been successfully installed.
Patch 125556-03 has been successfully installed.
Patch 139100-02 has been successfully installed.

Unmounting the BE <test>.
The patch add to the BE <test> completed.
Now that the alternate boot environment has been patched, we can activate it at our convenience.

I keep deleting and deleting and still can't get rid of those pesky boot environments

This is an interesting corner case where the Live Upgrade configuration files get so scrambled that even simple tasks like deleting a boot environment are not possible. Every time I have gotten myself into this situation I can trace it back to some ill advised shortcut that seemed harmless at the time, but I won't rule out bugs and environment as possible causes.

Here is our simple test case: turn our boot environment from the previous example into a zombie - something that is neither alive nor dead but just takes up space and causes a mild annoyance.

Important note: Don't try this on a production system. This is for demonstration purposes only.
# dd if=/dev/random of=/etc/lu/ICF.5 bs=2048 count=2
0+2 records in
0+2 records out

# ludelete -f test
System has findroot enabled GRUB
No entry for BE <test> in GRUB menu
ERROR: The mount point </.alt.tmp.b-fxc.mnt> is not a valid ABE mount point (no /etc directory found).
ERROR: The mount point </.alt.tmp.b-fxc.mnt> provided by the <-m> option is not a valid ABE mount point.
Usage: lurootspec [-l error_log] [-o outfile] [-m mntpt]
ERROR: Cannot determine root specification for BE <test>.
ERROR: boot environment <test> is not mounted
Unable to delete boot environment.
Our first task is to make sure that any partially mounted boot environment is cleaned up. A df should help us here.
# df -k | tail -5
arrakis/xvm/opensolaris 350945280      19 17448377     1%    /xvm/opensolaris
arrakis/xvm/s10u5    350945280      19 17448377     1%    /xvm/s10u5
arrakis/xvm/ub710    350945280      19 17448377     1%    /xvm/ub710
swap                 4549680       0 4549680     0%    /.alt.tmp.b-fxc.mnt/var/run
swap                 4549680       0 4549680     0%    /.alt.tmp.b-fxc.mnt/tmp


# umount /.alt.tmp.b-fxc.mnt/tmp
# umount /.alt.tmp.b-fxc.mnt/var/run
Ordinarily you would use lufslist(1M) to try to determine which file systems are in use by the boot environment you are trying to delete. In this worst case scenario that is not possible. A bit of forensic investigation and a bit more courage will help us figure this out.

The first place we will look is /etc/lutab. This is the configuration file that lists all boot environments known to Live Upgrade. There is a man page for this in section 4, so it is somewhat of a public interface but please take note of the warning
 
        The lutab file must not be edited by hand. Any user  modifi-
        cation  to  this file will result in the incorrect operation
        of the Live Upgrade feature.
This is very good advice and failing to follow it has led some some of my most spectacular Live Upgrade meltdowns. But in this case Live Upgrade is already broken and it may be possible to undo the damage and restore proper operation. So let's see what we can find out.
# cat /etc/lutab
# DO NOT EDIT THIS FILE BY HAND. This file is not a public interface.
# The format and contents of this file are subject to change.
# Any user modification to this file may result in the incorrect
# operation of Live Upgrade.
3:s10u5_baseline:C:0
3:/:/dev/dsk/c2d0s0:1
3:boot-device:/dev/dsk/c2d0s0:2
1:s10u5_lu:C:0
1:/:/dev/dsk/c5d0s0:1
1:boot-device:/dev/dsk/c5d0s0:2
2:s10u6_ufs:C:0
2:/:/dev/dsk/c4d0s0:1
2:boot-device:/dev/dsk/c4d0s0:2
4:s10u6_baseline:C:0
4:/:rpool/ROOT/s10u6_baseline:1
4:boot-device:/dev/dsk/c4d0s3:2
10:route66:C:0
10:/:rpool/ROOT/route66:1
10:boot-device:/dev/dsk/c4d0s3:2
11:nv95:C:0
11:/:rpool/ROOT/nv95:1
11:boot-device:/dev/dsk/c4d0s3:2
6:s10u7-baseline:C:0
6:/:rpool/ROOT/s10u7-baseline:1
6:boot-device:/dev/dsk/c4d0s3:2
7:nv114:C:0
7:/:rpool/ROOT/nv114:1
7:boot-device:/dev/dsk/c4d0s3:2
5:test:C:0
5:/:rpool/ROOT/test:1
5:boot-device:/dev/dsk/c4d0s3:2
We can see that the boot environment named test is (still) BE #5 and has it's root file system at rpool/ROOT/test. This is the default dataset name and indicates that the boot environment has not been renamed. Consider the following example for a more complicated configuration.
# lucreate -n scooby
# lufslist scooby | grep ROOT
rpool/ROOT/scooby       zfs            241152 /                   -
rpool/ROOT              zfs       39284664832 /rpool/ROOT         -

# lurename -e scooby -n doo
# lufslist doo | grep ROOT
rpool/ROOT/scooby       zfs            241152 /                   -
rpool/ROOT              zfs       39284664832 /rpool/ROOT         -
The point is that we have to trust the contents of /etc/lutab but it does not hurt to do a bit of sanity checking before we start deleting ZFS datasets. To remove boot environment test from the view of Live Upgrade, delete the three lines in /etc/lutab starting with 5 (in this example). We should also remove it's Internal Configuration File (ICF) /etc/lu/ICF.5
# mv -f /etc/lutab /etc/lutab.old
# grep -v ^5: /etc/lutab.old > /etc/lutab
# rm -f /etc/lu/ICF.5

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
s10u5_baseline             yes      no     no        yes    -         
s10u5_lu                   yes      no     no        yes    -         
s10u6_ufs                  yes      no     no        yes    -         
s10u6_baseline             yes      no     no        yes    -         
route66                    yes      no     no        yes    -         
nv95                       yes      yes    yes       no     -         
s10u7-baseline             yes      no     no        yes    -         
nv114                      yes      no     no        yes    -         
If the boot environment being deleted is in UFS then we are done. Well, not exactly - but pretty close. We still need to propagate the updated configuration files to the remaining boot environments. This will be done during the next live upgrade operation (lucreate, lumake, ludelete, luactivate) and I would recommend that you let Live Upgrade handle this part. The exception to this will be if you boot directly into another boot environment without activating it first. This isn't a recommended practice and has been the source of some of my most frustrating mistakes.

If the exorcised boot environment is in ZFS then we still have a little bit of work to do. We need to delete the old root datasets and any snapshots that they may have been cloned from. In our example the root dataset was rpool/ROOT/test. We need to look for any children as well as the originating snapshot, if present.
# zfs list -r rpool/ROOT/test
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/test       234K  6.47G  8.79G  /.alt.test
rpool/ROOT/test/var    18K  6.47G    18K  /.alt.test/var

# zfs get -r origin rpool/ROOT/test
NAME             PROPERTY  VALUE                 SOURCE
rpool/ROOT/test  origin    rpool/ROOT/nv95@test  -
rpool/ROOT/test/var  origin    rpool/ROOT/nv95/var@test    
       
# zfs destroy rpool/ROOT/test/var
# zfs destroy rpool/ROOT/nv95/var@test
# zfs destroy rpool/ROOT/test
# zfs destroy rpool/ROOT/nv95@test
Important note:luactivate will promote the newly activated root dataset so that snapshots used to create alternate boot environments should be easy to delete. If you are switching between boot environments without activating them first (which I have already warned you about doing), you may have to manually promote a different dataset so that the snapshots can be deleted.

To BE or not to BE - how about no BE ?

You may find yourself in a situation where you have things so scrambled up that you want to start all over again. We can use what we have just learned to unwind Live Upgrade and start from a clean configuration. Specifically we want to delete /etc/lutab, the ICF and related files, all of the temporary files in /etc/lu/tmp and a few files that hold environment variables for some of the lu scripts. And if using ZFS we will also have to delete any datasets and snapshots that are no longer needed.
 
# rm -f /etc/lutab 
# rm -f /etc/lu/ICF.* /etc/lu/INODE.* /etc/lu/vtoc.*
# rm -f /etc/lu/.??*
# rm -f /etc/lu/tmp/* 

# lustatus
ERROR: No boot environments are configured on this system
ERROR: cannot determine list of all boot environment names

# lucreate -c scooby -n doo
Checking GRUB menu...
Analyzing system configuration.
No name for current boot environment.
Current boot environment is named <scooby>.
Creating initial configuration for primary boot environment <scooby>.
The device </dev/dsk/c4d0s3> is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name <scooby> PBE Boot Device </dev/dsk/c4d0s3>.
Comparing source boot environment <scooby> file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <doo>.
Source boot environment is <scooby>.
Creating boot environment <doo>.
Cloning file systems from boot environment <scooby> to create boot environment <doo>.
Creating snapshot for <rpool/ROOT/scooby> on <rpool/ROOT/scooby@doo>.
Creating clone for <rpool/ROOT/scooby@doo> on <rpool/ROOT/doo>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/doo>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <doo> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <doo> in GRUB menu
Population of boot environment <doo> successful.
Creation of boot environment <doo> successful.

# luactivate doo
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE 

File  deletion successful
File  deletion successful
File  deletion successful
Activation of boot environment  successful.

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
scooby                     yes      yes    no        no     -         
doo                        yes      no     yes       no     -        
Pretty cool, eh ?

There are still a few more interesting corner cases, but we will deal with those in the one of the next articles. In the mean time, please remember to

Technocrati Tags:

May 21 2009, 11:58:42 AM CDT Permalink Comments [5]

20090324 Tuesday March 24, 2009
Adobe releases an x86 version of Acroread 9.1 for Solaris
Great Googly Moogly!!! Our friends at Adobe have finally released a new x86 version of Acroread for Solaris. Download Acroread 9.1 from Adobe.com and say goodbye to evince, xpdf, and the especially interesting Acroread out of the Linux branded zone trick.
Click image to enlarge 

Mar 24 2009, 04:16:48 PM CDT Permalink Comments [0]

20090323 Monday March 23, 2009
Dr. Live Upgrade - Or How I Learned to Stop Worrying and Love Solaris Patching
Who loves to patch or upgrade a system ?

That's right, nobody. Or if you do perhaps we should start a local support group to help you come to terms with this unusual fascination. Patching, and to a lesser extent upgrades (which can be thought of as patches delivered more efficiently through package replacement), is the the most common complaint that I hear when meeting with system administrators and their management.

Most of the difficulties seem to fit into one of the following categories. And if a single system gives you a headache, adding a few containers into the mix will bring on a full migraine. And without some relief you may be left with the impression that containers aren't worth the effort. That's unfortunate because containers don't have to be troublesome and patching doesn't have to be hard. But it does take getting to know one of the most important and sadly least used features in Solaris: Live Upgrade

Before we looking at Live Upgrade, let's start with a definition. A boot environment is the set of all file systems and devices that are unique to an instance of Solaris on a system. If you have several boot environments then some data will be shared (non svr4 package installed applications, data, local home directories) and some will be exclusive to one boot environment. Not making this more complicated than it needs to be, a boot environment is generally your root (including /usr and /etc), /var (frequently split out on a separate file system), and /opt. Swap may or may not be a part of a boot environment - it is your choice. I prefer to share swap, but there are some operational situations where this may not be feasible. There may be additional items, but generally everything else is shared. Network mounted file systems and removable media are assumed to be shared.

With this definition behind us, let's proceed.

Analysis: What patches need to be applied to my system ?

For all of the assistance that Live Upgrade offers, it doesn't do anything to help with the analysis phase. Fortunately there are plenty of tools that can help with this phase. Some of them work nicely with Live Upgrade, others take a bit more effort.

smpatch(1M) has an analyze capability that can determine which patches need to be applied to your system. It will get a list of patches from an update server, most likely one at Sun, and match up the dependencies and requirements with your system. smpatch can be used to download these patches for future application or it can apply them for you. smpatch works nicely with Live Upgrade, so from a single command you can upgrade an alternate boot environment. With containers!

The Sun Update Manager is a simple to use graphical front end for smpatch. It gives you a little more flexibility during the inspection phase by allowing you to look at individual patch README files. It is also much easier to see what collection a patch belongs to (recommended, security, none) and if the application of that patch will require a reboot. For all of that additional flexibility you lose the integration with Live Upgrade. Not for lack of trying, but I have not found a good way to make Update Manager and Live Upgrade play together.

Sun xVM Ops Center has a much more sophisticated patch analysis system that uses additional knowledge engines beyond those used by smpatch and Update Manager. The result is a higher quality patch bundle tailored for each individual system, automated deployment of the patch bundle, detailed auditing of what was done and simple backout should problems occur. And it basically does the same for Windows and Linux. It is this last feature that makes things interesting. Neither Windows nor Linux have anything like Live Upgrade and the least common denominator approach of Ops Center in its current state means that it doesn't work with Live Upgrade. Fortunately this will change in the not too distant future, and when it does I will be shouting about this feature from rooftops (OK, what I really mean is I'll post a blog and a tweet about it). If I can coax Ops Center into doing the analysis and download pieces then I can manually bolt it onto Live Upgrade for a best of both worlds solution.

These are our offerings and there are others. Some of them are quite good and in use in many places. Patch Check Advanced (PCA) is one of the more common tools in use. It operates on a patch dependency cross reference file and does a good job with the dependency analysis (this is obsoleted by that, etc). It can be used to maintain an alternate boot environment and in simple cases that would be fine. If the alternate boot environment contains any containers then I would use Live Upgrade's luupgrade instead of PCA's patchadd -R approach. If I was familiar with PCA then I would still use it for the analysis and download feature. Just let luupgrade apply the patches. You might have to uncompress the patches downloaded by PCA before handing them over to luupgrade, but that is a minor implementation detail.

In summary, use an analysis tool appropriate to the task (based on familiarity, budget and complexity) to figure out what patches are needed. Then use Live Upgrade (luupgrade) to deploy the desired patches.

Effort: What does it take to perform the required maintenance ?

This is a big topic and I could write pages on the subject. Even if I use an analysis tool like smpatch or pca to save me hours of trolling through READMEs drawing dependency graphs, there is still a lot of work to do in order to survive the ordeal of applying patches. Some of the more common techniques include ....
Backing up your boot environment.
I should not have to mention this, but there are some operational considerations unique to system maintenance. Even though tiny, there is a greater chance that you will render your system non-bootable during system maintenance than any other operational task. Even with mature processes, human factors can come into play and bad things can happen (oops - that was my fallback boot environment that I just ran newfs(1M) on).

This is why automation and time tested scripting becomes so important. Should you do the unthinkable and render a system nonfunctional, rapid restoration of the boot environment is important. And getting it back to the last known good state is just as important. A fresh backup that can be restored by utilities from install media or jumpstart miniroot is a very good idea. Flash archives (see flarcreate(1M)) is even better, although complications with containers make this less interesting now than in previous releases of Solaris. How many of you take a backup before applying patches ? Probably about the same number as replace batteries in your RAID controllers or change out your UPS systems after their expiration date.

Split Mirrors
One interesting technique is to split mirrors instead of backups. Of course this only works if you mirror your boot environment (a recommended practice for those systems with adequate disk space). Break your mirror, apply patches to the non-running half, cut over the updated boot environment during the next maintenance window and see how this goes. At first glance this seems like a good idea, but there are two catches.
  1. Do you synchronize dynamic boot environment elements ? Things like /etc/passwd, /etc/shadow, /var/adm/messages, print and mail queues are constantly changing. It is possible that these have changed between the mirror split and subsequent activation.
  2. How long are you willing to run without your boot environment being mirrored ? This may cause to you certify the new boot environment too quickly. You want to reestablish your mirror, but if that is your fallback in case of trouble you have a conundrum. And if you are the sort that seems to have a black cloud following you through life, you will discover a problem shortly after you started the mirror resync.
Pez disks ?
OK, the mirror split thing can be solved by swinging in another disk. Operationally a bit more complex and you have at least one disk that you can't use for other purposes (like hosting a few containers), but it can be done. I wouldn't do it (mainly because I know where this story is heading) but many of you do.
Better living through Live Upgrade
Everything we do to try to make it better adds complexity, or another hundred lines of scripting. It doesn't need to be this way, and if you become one with the LU commands it won't for you either. Live Upgrade will take care building and updating multiple boot environments. It will check to make sure the disks being used are bootable and not part of another boot environment. It works with the Solaris Volume Manager, Veritas encapulated root devices, and starting with Solaris 10 10/08 (update 6) ZFS. It also takes care of the synchronization problem. Starting with Solaris 10 8/07 (update 4), Live Upgrade also works with containers, both native and branded (and with Solaris 10 10/08 your zoneroots can be in a ZFS pool).

Outage: How long will my system be down for the maintenance?

Or perhaps more to the point, how long will my applications be unavailable ? The proper reply is it depends on how big the patch bundle is and how many containers you have. And if a kernel patch is involved, double or triple your estimate. This can be a big problem and cause you to take short cuts like only install some patches now and others later when it is more convenient. Our good friend Bart Smaalders has a nice discussion on the implications of this approach and what we are doing in OpenSolaris to solve this. That solution will eventually work its way into the Next Solaris, but in the mean time we have a problem to solve.

There is a large set (not really large, but more than one) of patches that require a quiescent system to be properly applied. An example would be a kernel patch that causes a change to libc. It is sort of hard to rip out libc on a running system (new processes get the new libc my may have issues with the running kernel, old processes get the old libc and tend to be fine, until they do a fork(2) and exec(2)). So we developed a brilliant solution to this problem - deferred activation patching. If you apply one of these troublesome patches then we will throw it in a queue to be applied the next time the system is quiesced (a fancy term for the next time we're in single user mode). This solves the current system stability concerns but may make the next reboot take a bit longer. And if you forgot you have deferred patches in your queue, don't get anxious and interrupt the shutdown or next boot. Grab a noncaffeinated beverage and put some Bobby McFerrin on your iPod. Don't Worry, Be Happy.

So deferred activation patching seems like a good way to deal with situation where everything goes well. And some brilliant engineers are working on applying patches in parallel (where applicable) which will make this even better. But what happens when things go wrong ? This is when you realize that patchrm(1M) is not your friend. It has never been your friend, nor will it ever be. I have an almost paralyzing fear of dentists, but would rather visit one then start down a path where patchrm is involved. Well tested tools and some automation can reduce this to simple anxiety, but if I could eliminate patchrm altogether I would be much happier.

For all that Live Upgrade can do to ease system maintenance, it is in the area of outage and recovery that make it special. And when speaking about Solaris, either in training or evangelism events, this is why I urge attendees to drop whatever they are doing and adopt Live Upgrade immediately.

Since Live Upgrade (lucreate, lumake, luupgrade) operates on an alternate boot environment, the currently running set of applications are not affected. The system stays up, applications stay running and nothing is changing underneath them so there is no cause for concern. The only impact is some additional load by the live upgrade operations. If that is a concern then run live upgrade in a project and cap resource consumption to that project.

An interesting implication of Live Upgrade is that the operational sanity of each step is no longer required. All that matters is the end state. This gives us more freedom to apply patches in a more efficient fashion than would be possible on a running boot environment. This is especially noticeable on a system with containers. The time that the upgrade runs is significantly reduced, and all the while applications are running. No more deferred activation patches, no more single user mode patching. And if all goes poorly after activating the new boot environment you still have your old one to fall back on. Queue Bobby McFerrin for another round of "Don't Worry, Be Happy".

This brings up another feature of Live Upgrade - the synchronization of system files in flight between boot environments. After a boot environment is activated, a synchronization process is queued as a K0 script to be run during shutdown. Live Upgrade will catch a lot of private files that we know about and the obvious public ones (/etc/passwd, /etc/shadow, /var/adm/messages, mail queues). It also provides a place (/etc/lu/synclist) for you to include things we might not have thought about or are unique to your applications.

When using Live Upgrade applications are only unavailable for the amount of time it takes to shut down the system (the synchronization process) and boot the new boot environment. This may include some minor SMF manifest importing but that should not add much to the new boot time. You only have to complete the restart during a maintenance window, not the entire upgrade. While vampires are all the rage for teenagers these days, system administrators can now come out into the light and work regular hours.

Recovery: What happens when something goes wrong?

This is when you will fully appreciate Live Upgrade. After activation of a new boot environment, now called the Primary Boot Environment (PBE), your old boot environment, now called an Alternate Boot Environment (ABE) can still be called upon in case of trouble. Just activate it and shut down the system. Applications will be down for a short period (the K0 sync and subsequence start up), but there will be no more wringing of the hands, reaching for beverages with too much caffeine and vitamin B12, trying to remember where you kept your bottle of Tums. Queue Bobby McFerrin one more timne and "Don't Worry, Be Happy". You will be back to your previous operational state in a matter of a few minutes (longer if you have a large server with many disks). Then you can mount up your ABE and troll through the logs trying to determine what went wrong. If you have a service contract then we will troll through the logs with you.

I neglected to mention earlier, disks that comprise boot environments can be mirrored, so there is no rush to certification. Everything can be mirrored, at all times. Which is a very good thing. You still need to back up your boot environments, but you will find yourself reaching for the backup media much less often when using Live Upgrade.

All that is left are a few simple examples of how to use Live Upgrade. I'll save that for next time.

Technocrati Tags:

Mar 23 2009, 12:52:09 AM CDT Permalink Comments [17]

20081111 Tuesday November 11, 2008
OpenSolaris 2008.11 Release Candidate 1B (nv101a) is now available for testing
The initial release candidate (rc1b) for OpenSolaris 2008.11 (based on nv101a) is now available for download and testing. Additional (larger) images are available for non-English locales as well as USB images for faster installs. If you have not played with a USB image you will be dazzled at the speed of the installation. Amazing what happens when you eliminate all those slow seeks.

The new release candidate has quite a few interesting features and updates. The items that caught my attention were Our own Dan Roberts has more to say on the subject in this video podcast.

Using the graphical package manager it only took a few minutes to set up the installation plan for a nice web based development system including Netbeans, a web stack (including Glassfish), and a Xen based virtualization system.

OpenSolaris 2008.11 is shaping up to be quite a nice release. Now that I have figured out how to make it play nicely in a root zpool with other Solaris releases, I will be spending a lot more time with it as the daily driver.

Download it, play with it, and please remember to file bugs when you run into things that don't work.

Technocrati Tags:

Nov 11 2008, 09:14:30 AM CST Permalink Comments [0]

20081104 Tuesday November 04, 2008
Solaris and OpenSolaris coexistence in the same root zpool
Some time ago, my buddy Jeff Victor gave us FrankenZone. An idea that is disturbingly brilliant. It has taken me a while, but I offer for your consideration VirtualBox as a V2P platform for OpenSolaris. Nowhere near as brilliant, but at least as unusual. And you know that you have to try this out at home.

Note: This is totally a science experiment. I fully expect to see the two guys from Myth Busters showing up at any moment. It also requires at least build 100 of OpenSolaris on both the host and guest operating system to work around the hostid difficulties.

With the caveats out of the way, let me set the back story to explain how I got here.

Until virtualization technologies become ubiquitous and nothing more than BIOS extensions, multi-boot configurations will continue to be an important capability. And for those working with [Open]Solaris there are several limitations that complicate this unnecessarily. Rather than lamenting these, the possibility of leveraging ZFS root pools, now in Solaris 10 10/08, should offer up some interesting solutions.

What I want to do is simple - have a single Solaris fdisk partition that can have multiple versions of Solaris all bootable with access to all of my data. This doesn't seem like much of a request, but as of yet this has been nearly impossible to accomplish in anything close to a supportable configuration. As it turns out the essential limitation is in the installer - all other issues can be handled if we can figure out how to install OpenSolaris into an existing pool.

What we will do is use our friend VirtualBox to work around the installer issues. After installing OpenSolaris in a virtual machine we take a ZFS snapshot, send it to the bare metal Solaris host and restore it in the root pool. Finally we fix up a few configuration files to make everything work and we will be left with a single root pool that can boot Solaris 10, Solaris Express Community Edition (nevada), and OpenSolaris.

How cool is that :-) Yeah, it is that cool. Let's proceed.

Prepare the host system

The host system is running a fresh install of Solaris 10 10/08 with a single large root zpool. In this example the root zpool is named panroot. There is also a separate zpool that contains data that needs to be preserved in case a re-installation of Solaris is required. That zpool is named pandora, but it doesn't matter - it will be automatically imported in our new OpenSolaris installation if all goes well.
# lustatus 
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
s10u6_baseline             yes      no     no        yes    -         
s10u6                      yes      no     no        yes    -         
nv95                       yes      yes    yes       no     -         
nv101a                     yes      no     no        yes    -    

     
# zpool list
NAME      SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
pandora  64.5G  56.9G  7.61G    88%  ONLINE  -
panroot    40G  26.7G  13.3G    66%  ONLINE  -
One challenge that came up was the less than stellar performance of ssh over the VirtualBox NAT interface. So rather than fight this I set up a shared NFS file system in the root pool to stage the ZFS backup file. This made the process go much faster.

In the host Solaris system
# zfs create -o sharenfs=rw,anon=0 -o mountpoint=/share panroot/share

Prepare the OpenSolaris virtual machine

If you have not already done so, get a copy of VirtualBox, install it and set up a virtual machine for OpenSolaris.

Important note: Do not install the VirtualBox guest additions. This will install some SMF services that will fail when booted on bare metal.

Send a ZFS snapshot to the host OS root zpool

Let's take a look around the freshly installed OpenSolaris system to see what we want to send.

Inside the OpenSolaris virtual machine
bash-3.2$ zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
rpool                   6.13G  9.50G    46K  /rpool
rpool/ROOT              2.56G  9.50G    18K  legacy
rpool/ROOT/opensolaris  2.56G  9.50G  2.49G  /
rpool/dump               511M  9.50G   511M  -
rpool/export            2.57G  9.50G  2.57G  /export
rpool/export/home        604K  9.50G    19K  /export/home
rpool/export/home/bob    585K  9.50G   585K  /export/home/bob
rpool/swap               512M  9.82G   176M  -
My host system root zpool (panroot) already has swap and dump, so these won't be needed. And it also has an /export hierarchy for home directories. I will recreate my OpenSolaris Primary System Administrator user once on bare metal, so it appears the only thing I need to bring over is the root dataset itself.

Inside the OpenSolaris virtual machine
bash-3.2$ pfexec zfs snapshot rpool/ROOT/opensolaris@scooby
bash-3.2$ pfexec zfs send rpool/ROOT/opensolaris@scooby > /net/10.0.2.2/share/scooby.zfs
We are now done with the virtual machine. It can be shut down and the storage reclaimed for other purposes.

Restore the ZFS dataset in the host system root pool

In addition to restoring the OpenSolaris root pool, the canmount property should be set to noauto. I also destroy the NFS shared directory since it will no longer be needed.
# zfs receive panroot/ROOT/scooby < /share/scooby.zfs
# zfs set canmount=noauto panroot/ROOT/scooby
# zfs destroy panroot/shared
Now mount the new OpenSolaris root filesystem and fix up a few configuration files. Specifically Rebuild the OpenSolaris boot archive and we will be done with that filesystem.
# zfs set canmount=noauto panroot/ROOT/scooby
# zfs set mountpoint=/mnt panroot/ROOT/scooby
# zfs mount panroot/ROOT/scooby

# cp /etc/zfs/zpool.cache /mnt/etc/zfs
# cp /etc/hostid /mnt/etc/hostid

# bootadm update-archive -f -R /mnt
Creating boot_archive for /mnt
updating /mnt/platform/i86pc/amd64/boot_archive
updating /mnt/platform/i86pc/boot_archive

# umount /mnt
Make a home directory for your OpenSolaris administrator user (in this example the user is named admin). Also add a GRUB stanza so that OpenSolaris can be booted.
# mkdir -p /export/home/admin
# chown admin:admin /export/home/admin
# cat > /panroot/boot/grub/menu.lst   <<DOO
title Scooby
root (hd0,3,a)
bootfs panroot/ROOT/scooby
kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive
DOO
At this point we are done. Reboot the system and you should see a new GRUB stanza for our new OpenSolaris installation (scooby). Cue large audience applause track.

Live Upgrade and OpenSolaris Boot Environment Administration

On interesting side effect, on the positive side, is the healthy interaction of Live Upgrade and beadm(1M). For your Solaris and nevada based installations you can continue to use lucreate(1M), luupgrade(1M), and luactivate(1M). On the OpenSolaris side you can see all of your Live Upgrade boot environments as well as your OpenSolaris boot environments. Note that we can create and activate new boot environments as needed.

When in OpenSolaris
# beadm list
BE                           Active Mountpoint Space   Policy Created          
--                           ------ ---------- -----   ------ -------          
nv101a                       -      -          18.17G  static 2008-11-04 00:03 
nv95                         -      -          122.07M static 2008-11-03 12:47 
opensolaris                  -      -          2.83G   static 2008-11-03 16:23 
opensolaris-2008.11-baseline R      -          2.49G   static 2008-11-04 11:16 
s10u6                        -      -          97.22M  static 2008-11-03 12:03 
s10x_u6wos_07b               -      -          205.48M static 2008-11-01 20:51 
scooby                       N      /          2.61G   static 2008-11-04 10:29 

# beadm create doo
# beadm activate doo
# beadm list
BE                           Active Mountpoint Space   Policy Created          
--                           ------ ---------- -----   ------ -------          
doo                          R      -          5.37G   static 2008-11-04 16:23 
nv101a                       -      -          18.17G  static 2008-11-04 00:03 
nv95                         -      -          122.07M static 2008-11-03 12:47 
opensolaris                  -      -          25.5K   static 2008-11-03 16:23 
opensolaris-2008.11-baseline -      -          105.0K  static 2008-11-04 11:16 
s10u6                        -      -          97.22M  static 2008-11-03 12:03 
s10x_u6wos_07b               -      -          205.48M static 2008-11-01 20:51 
scooby                       N      /          2.61G   static 2008-11-04 10:29 

For the first time I have a single Solaris disk environment that can boot Solaris 10, Solaris Express Community Edition (nevada) or OpenSolaris and have access to all of my data. I did have to add a mount for my shared FAT32 file system (I have an iPhone and several iPods - so Windows do occasionally get opened), but that is about it. Now off to the repository to start playing with all of the new OpenSolaris goodies like Songbird, Brasero, Bluefish and the Xen bits.

Technocrati Tags:

Nov 04 2008, 06:55:44 PM CST Permalink Comments [6]

20080218 Monday February 18, 2008
ZFS and FMA - Two great tastes .....
Our good friend Isaac Rozenfeld talks about the Multiplicity of Solaris. When talking about Solaris I will use the phrase "The Vastness of Solaris". If you have attended a Solaris Boot Camp or Tech Day in the last few years you get an idea of what we are talking about - when we go on about Solaris hour after hour after hour.

But the key point in Isaac's multiplicity discussion is how the cornucopia of Solaris features work together to do some pretty spectacular (and competitively differentiating) things. In the past we've looked at combinations such as ZFS and Zones or Service Management, Role Based Access Control (RBAC) and Least Privilege. Based on a conversation last week in St. Louis, let's consider how ZFS and Solaris Fault Management (FMA) play together.

Preparation

Let's begin by creating some fake devices that we can play with. I don't have enough disks on this particular system, but I'm not going to let that slow me down. If you have sufficient real hot swappable disks, feel free to use them instead.
# mkfile 1g /dev/disk1
# mkfile 1g /dev/disk2
# mkfile 512m /dev/disk3
# mkfile 512m /dev/disk4
# mkfile 1g /dev/disk5

Now let's create a couple of zpools using the fake devices. pool1 will be a 1GB mirrored pool using disk1 and disk2. pool2 will be a 512MB mirrored pool using disk3 and disk4. Device spare1 will spare both pools in case of a problem - which we are about to inflict upon the pools.
# zpool create pool1 mirror disk1 disk2 spare spare1
# zpool create pool2 mirror disk3 disk4 spare spare1
# zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

So far so good. If we were to run a scrub on either pool, it will complete immediately. Remember that unlike hardware RAID disk replacement, ZFS scrubbing and resilvering only touches blocks that contain actual data. Since there is no data in these pools (yet), there is little for the scrubbing process to do.
# zpool scrub pool1
# zpool scrub pool2
# zpool status
  pool: pool1
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 09:24:16 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 09:24:17 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

Let's populate both pools with some data. I happen to have a directory of scenic images that I use as screen backgrounds - that will work nicely.

# cd /export/pub/pix>
# find scenic -print | cpio -pdum /pool1
# find scenic -print | cpio -pdum /pool2

# df -k | grep pool
pool1                1007616  248925  758539    25%    /pool1
pool2                 483328  248921  234204    52%    /pool2

And yes, cp -r would have been just as good.

Problem 1: Simple data corruption

Time to inflict some harm upon the pool. First, some simple corruption. Writing some zeros over half of the mirror should do quite nicely.
# dd if=/dev/zero of=/dev/dsk/disk1 bs=8192 count=10000 conv=notrunc
10000+0 records in
10000+0 records out 

At this point we are unaware that anything has happened to our data. So let's try accessing some of the data to see if we can observe ZFS self healing in action. If your system has plenty of memory and is relatively idle, accessing the data may not be sufficient. If you still end up with no errors after the cpio, try a zpool scrub - that will catch all errors in the data.
# cd /pool1
# find . -print | cpio -ov > /dev/null
416027 blocks

Let's ask our friend fmstat(1m) if anything is wrong ?
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.1   0   0     0     0      0      0
disk-transport           0       0  0.0  366.5   0   0     0     0    32b      0
eft                      0       0  0.0    2.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       1       0  0.0    0.2   0   0     0     0      0      0
io-retire                0       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             1       0  0.0   16.0   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  620.3   0   0     0     0      0      0
syslog-msgs              1       0  0.0    9.7   0   0     0     0      0      0
zfs-diagnosis          162     162  0.0    1.5   0   0     1     0   168b   140b
zfs-retire               1       1  0.0  112.3   0   0     0     0      0      0

As the guys in the Guinness commercial say, "Brilliant!" The important thing to note here is that the zfs-diagnosis engine has run several times indicating that there is a problem somewhere in one of my pools. I'm also running this on Nevada so the zfs-retire engine has also run, kicking in a hot spare due to excessive errors.

So which pool is having the problems ? We continue our FMA investigation to find out.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e  ZFS-8000-GH    Major    

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.


# zpool status -x
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress, 44.83% done, 0h0m to go
config:

        NAME          STATE     READ WRITE CKSUM
        pool1         DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              disk1   DEGRADED     0     0   162  too many errors
              spare1  ONLINE       0     0     0
            disk2     ONLINE       0     0     0
        spares
          spare1      INUSE     currently in use

errors: No known data errors

This tells us all that we need to know. The device disk1 was found to have quite a few checksum errors - so many in fact that it was replaced automatically by a hot spare. The spare was resilvering and a full complement of data replicas would be available soon. The entire process was automatic and completely observable.

Since we inflicted harm upon the (fake) disk device ourself, we know that it is in fact quite healthy. So we can restore our pool to its original configuration rather simply - by detaching the spare and clearing the error. We should also clear the FMA counters and repair the ZFS vdev so that we can tell if anything else is misbehaving in either this or another pool.
# zpool detach pool1 spare1
# zpool clear pool
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 10:25:26 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors


# fmadm reset zfs-diagnosis
# fmadm reset zfs-retire
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  223.5   0   0     0     0    32b      0
eft                      1       0  0.0    4.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       4       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             4       0  0.0    8.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  372.7   0   0     0     0      0      0
syslog-msgs              4       0  0.0    5.4   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    1.4   0   0     0     0      0      0
zfs-retire               0       0  0.0    0.0   0   0     0     0      0      0


# fmdump -v -u d82d1716-c920-6243-e899-b7ddd386902e
TIME                 UUID                                 SUNW-MSG-ID
Feb 18 09:51:49.3025 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
  100%  fault.fs.zfs.vdev.checksum

        Problem in: 
           Affects: zfs://pool=pool1/vdev=449a3328bc444732
               FRU: -
          Location: -

# fmadm repair zfs://pool=pool1/vdev=449a3328bc444732
fmadm: recorded repair to zfs://pool=pool1/vdev=449a3328bc444732

# fmadm faulty

Problem 2: Device failure

Time to do a little more harm. In this case I will simulate the failure of a device by removing the fake device. Again we will access the pool and then consult fmstat to see what is happening (are you noticing a pattern here????).
# rm -f /dev/dsk/disk2
# cd /pool1
# find . -print | cpio -oc > /dev/null
416027 blocks

# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  214.2   0   0     0     0    32b      0
eft                      1       0  0.0    4.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       4       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             4       0  0.0    8.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  372.7   0   0     0     0      0      0
syslog-msgs              4       0  0.0    5.4   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    1.4   0   0     0     0      0      0
zfs-retire               0       0  0.0    0.0   0   0     0     0      0      0

Rats, the find ran totally out of cache from the last example. As before, should this happen,proceed directly to zpool scrub.
# zpool scrub pool1
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  190.5   0   0     0     0    32b      0
eft                      1       0  0.0    4.1   0   0     0     0   1.4M      0
fmd-self-diagnosis       5       0  0.0    0.5   0   0     0     0      0      0
io-retire                1       0  0.0    1.0   0   0     0     0      0      0
snmp-trapgen             6       0  0.0    7.4   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  329.0   0   0     0     0      0      0
syslog-msgs              6       0  0.0    4.6   0   0     0     0      0      0
zfs-diagnosis           16       1  0.0   70.3   0   0     1     1   168b   140b
zfs-retire               1       0  0.0  509.8   0   0     0     0      0      0

Again, hot sparing has kicked in automatically. The evidence of this is the zfs-retire engine running.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 11:07:29 50ea07a0-2cd9-6bfb-ff9e-e219740052d5  ZFS-8000-D3    Major    
Feb 18 11:16:43 06bfe323-2570-46e8-f1a2-e00d8970ed0d

Fault class : fault.fs.zfs.device

Description : A ZFS device failed.  Refer to http://sun.com/msg/ZFS-8000-D3 for
              more information.

Response    : No automated response will occur.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

# zpool status -x
  pool: pool1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver in progress, 4.94% done, 0h0m to go
config:

        NAME          STATE     READ WRITE CKSUM
        pool1         DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            disk1     ONLINE       0     0     0
            spare     DEGRADED     0     0     0
              disk2   UNAVAIL      0     0     0  cannot open
              spare1  ONLINE       0     0     0
        spares
          spare1      INUSE     currently in use

errors: No known data errors

As before, this tells us all that we need to know. A device (disk2) has failed and is no longer in operation. Sufficient spares existed and one was automatically attached to the damaged pool. Resilvering completed successfully and the data is once again fully mirrored.

But here's the magic. Let's repair the device - again simulated with our fake device.
# mkfile 1g /dev/dsk/disk2
# zpool repair pool1 disk2
# zpool status pool1 
  pool: pool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 4.86% done, 0h1m to go
config:

        NAME               STATE     READ WRITE CKSUM
        pool1              DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            disk1          ONLINE       0     0     0
            spare          DEGRADED     0     0     0
              replacing    DEGRADED     0     0     0
                disk2/old  UNAVAIL      0     0     0  cannot open
                disk2      ONLINE       0     0     0
              spare1       ONLINE       0     0     0
        spares
          spare1           INUSE     currently in use

errors: No known data errors

Get a cup of coffee while the resilvering process runs.
# zpool status
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   


# fmadm faulty

Notice the nice integration with FMA. Not only was the new device resilvered, but the hot spare was detached and the FMA fault was cleared. The fmstat counters still show that there was a problem and the fault report still existes in the fault log for later interrogation.
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  171.5   0   0     0     0    32b      0
eft                      1       0  0.0    3.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       6       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    0.9   0   0     0     0      0      0
snmp-trapgen             6       0  0.0    6.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  294.3   0   0     0     0      0      0
syslog-msgs              6       0  0.0    4.2   0   0     0     0      0      0
zfs-diagnosis           36       1  0.0   51.6   0   0     0     1      0      0
zfs-retire               1       0  0.0  170.0   0   0     0     0      0      0

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Feb 16 11:38:16.0976 48935791-ff83-e622-fbe1-d54c20385afc ZFS-8000-GH
Feb 16 11:38:30.8519 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233 ZFS-8000-GH
Feb 18 09:51:49.3025 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713 ZFS-8000-GH
Feb 18 09:56:24.8029 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
Feb 18 10:23:07.2228 7c04a6f7-d22a-e467-c44d-80810f27b711 ZFS-8000-GH
Feb 18 10:25:14.6429 faca0639-b82b-c8e8-c8d4-fc085bc03caa ZFS-8000-GH
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3
Feb 18 11:16:44.2497 06bfe323-2570-46e8-f1a2-e00d8970ed0d ZFS-8000-D3


# fmdump -V -u 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
TIME                 UUID                                 SUNW-MSG-ID
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3

  TIME                 CLASS                                 ENA
  Feb 18 11:07:27.8476 ereport.fs.zfs.vdev.open_failed       0xb22406c635500401

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
        code = ZFS-8000-D3
        diag-time = 1203354449 236999
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = Dimension XPS                
                        chassis-id = 7XQPV21
                        server-id = arrakis
                (end authority)

                mod-name = zfs-diagnosis
                mod-version = 1.0
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.fs.zfs.device
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x3a2ca6bebd96cfe3
                        vdev = 0xedef914b5d9eae8d
                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x3a2ca6bebd96cfe3
                        vdev = 0xedef914b5d9eae8d
                (end resource)

        (end fault-list[0])

        fault-status = 0x3
        __ttl = 0x1
        __tod = 0x47b9bb51 0x1ef7b430

# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset

# fmadm reset zfs-retire
fmadm: zfs-retire module has been reset

Problem 3: Unrecoverable corruption

For those of you that have attended one of my Boot Camps or Solaris Best Practices training classes know, House is one of my favorite TV shows - the only one that I watch regularly. And this next example would make a perfect episode. Is it likely to happen ? No, but it is so cool when it does :-)

Remember our second pool, pool2. It has the same contents as pool1. Now, let's do the unthinkable - let's corrupt both halves of the mirror. Surely data loss will follow, but the fact that Solaris stays up and running and can report what happened is pretty spectacular. But it gets so much better than that.
# dd if=/dev/zero of=/dev/dsk/disk3 bs=8192 count=10000 conv=notrunc
# dd if=/dev/zero of=/dev/dsk/disk4 bs=8192 count=10000 conv=notrunc
# zpool scrub pool2

# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  166.0   0   0     0     0    32b      0
eft                      1       0  0.0    3.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       6       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    0.9   0   0     0     0      0      0
snmp-trapgen             8       0  0.0    6.3   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  294.3   0   0     0     0      0      0
syslog-msgs              8       0  0.0    3.9   0   0     0     0      0      0
zfs-diagnosis         1032    1028  0.6   39.7   0   0    93     2    15K    13K
zfs-retire               2       0  0.0  158.5   0   0     0     0      0      0

As before, lots of zfs-diagnosis activity. And two hits to zfs-retire. But we only have one spare - this should be interesting. Let's see what is happenening.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e  ZFS-8000-GH    Major    
Feb 18 13:18:42 c3889bf1-8551-6956-acd4-914474093cd7

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 16 11:38:30 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233  ZFS-8000-GH    Major    
Feb 18 09:51:49 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713
Feb 18 10:23:07 7c04a6f7-d22a-e467-c44d-80810f27b711
Feb 18 13:18:42 0a1bf156-6968-4956-d015-cc121a866790

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

# zpool status -x
  pool: pool2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:

        NAME          STATE     READ WRITE CKSUM
        pool2         DEGRADED     0     0 2.60K
          mirror      DEGRADED     0     0 2.60K
            spare     DEGRADED     0     0 2.43K
              disk3   DEGRADED     0     0 5.19K  too many errors
              spare1  DEGRADED     0     0 2.43K  too many errors
            disk4     DEGRADED     0     0 5.19K  too many errors
        spares
          spare1      INUSE     currently in use

errors: 247 data errors, use '-v' for a list

So ZFS tried to bring in a hot spare, but there were insufficient replicas to be able to reconstruct all of the data. But here is where is gets interesting. Let's see what zpool status -v says about things.
zpool status -v
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    INUSE     in use by pool 'pool2'

errors: No known data errors

  pool: pool2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:

        NAME          STATE     READ WRITE CKSUM
        pool2         DEGRADED     0     0 2.60K
          mirror      DEGRADED     0     0 2.60K
            spare     DEGRADED     0     0 2.43K
              disk3   DEGRADED     0     0 5.19K  too many errors
              spare1  DEGRADED     0     0 2.43K  too many errors
            disk4     DEGRADED     0     0 5.19K  too many errors
        spares
          spare1      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        /pool2/scenic/cider mill crowds.jpg
        /pool2/scenic/Cleywindmill.jpg
        /pool2/scenic/csg_Landscapes001_GrandTetonNationalPark,Wyoming.jpg
        /pool2/scenic/csg_Landscapes002_ElowahFalls,Oregon.jpg
        /pool2/scenic/csg_Landscapes003_MonoLake,California.jpg
        /pool2/scenic/csg_Landscapes005_TurretArch,Utah.jpg
        /pool2/scenic/csg_Landscapes004_Wildflowers_MountRainer,Washington.jpg
        /pool2/scenic/csg_Landscapes!idx011.jpg
        /pool2/scenic/csg_Landscapes127_GreatSmokeyMountains-NorthCarolina.jpg
        /pool2/scenic/csg_Landscapes129_AcadiaNationalPark-Maine.jpg
        /pool2/scenic/csg_Landscapes130_GettysburgNationalPark-Pennsylvania.jpg
        /pool2/scenic/csg_Landscapes131_DeadHorseMill,CrystalRiver-Colorado.jpg
        /pool2/scenic/csg_Landscapes132_GladeCreekGristmill,BabcockStatePark-WestVirginia.jpg
        /pool2/scenic/csg_Landscapes133_BlackwaterFallsStatePark-WestVirginia.jpg
        /pool2/scenic/csg_Landscapes134_GrandCanyonNationalPark-Arizona.jpg
        /pool2/scenic/decisions decisions.jpg
        /pool2/scenic/csg_Landscapes135_BigSur-California.jpg
        /pool2/scenic/csg_Landscapes151_WataugaCounty-NorthCarolina.jpg
        /pool2/scenic/csg_Landscapes150_LakeInTheMedicineBowMountains-Wyoming.jpg
        /pool2/scenic/csg_Landscapes152_WinterPassage,PondMountain-Tennessee.jpg
        /pool2/scenic/csg_Landscapes154_StormAftermath,OconeeCounty-Georgia.jpg
        /pool2/scenic/Brig_Of_Dee.gif
        /pool2/scenic/pvnature14.gif
        /pool2/scenic/pvnature22.gif
        /pool2/scenic/pvnature7.gif
        /pool2/scenic/guadalupe.jpg
        /pool2/scenic/ernst-tinaja.jpg
        /pool2/scenic/pipes.gif
        /pool2/scenic/boat.jpg
        /pool2/scenic/pvhawaii.gif
        /pool2/scenic/cribgoch.jpg
        /pool2/scenic/sun1.gif
        /pool2/scenic/sun1.jpg
        /pool2/scenic/sun2.jpg
        /pool2/scenic/andes.jpg
        /pool2/scenic/treesky.gif
        /pool2/scenic/sailboatm.gif
        /pool2/scenic/Arizona1.jpg
        /pool2/scenic/Arizona2.jpg
        /pool2/scenic/Fence.jpg
        /pool2/scenic/Rockwood.jpg
        /pool2/scenic/sawtooth.jpg
        /pool2/scenic/pvaptr04.gif
        /pool2/scenic/pvaptr07.gif
        /pool2/scenic/pvaptr11.gif
        /pool2/scenic/pvntrr01.jpg
        /pool2/scenic/Millport.jpg
        /pool2/scenic/bryce2.jpg
        /pool2/scenic/bryce3.jpg
        /pool2/scenic/monument.jpg
        /pool2/scenic/rainier1.gif
        /pool2/scenic/arch.gif
        /pool2/scenic/pv-anzab.gif
        /pool2/scenic/pvnatr15.gif
        /pool2/scenic/pvocean3.gif
        /pool2/scenic/pvorngwv.gif
        /pool2/scenic/pvrmp001.gif
        /pool2/scenic/pvscen07.gif
        /pool2/scenic/pvsltd04.gif
        /pool2/scenic/banhall28600-04.JPG
        /pool2/scenic/pvwlnd01.gif
        /pool2/scenic/pvnature08.gif
        /pool2/scenic/pvnature13.gif
        /pool2/scenic/nokomis.jpg
        /pool2/scenic/lighthouse1.gif
        /pool2/scenic/lush.gif
        /pool2/scenic/oldmill.gif
        /pool2/scenic/gc1.jpg
        /pool2/scenic/gc2.jpg
        /pool2/scenic/canoe.gif
        /pool2/scenic/Donaldson-River.jpg
        /pool2/scenic/beach.gif
        /pool2/scenic/janloop.jpg
        /pool2/scenic/grobacro.jpg
        /pool2/scenic/fnlgld.jpg
        /pool2/scenic/bells.gif
        /pool2/scenic/Eilean_Donan.gif
        /pool2/scenic/Kilchurn_Castle.gif
        /pool2/scenic/Plockton.gif
        /pool2/scenic/Tantallon_Castle.gif
        /pool2/scenic/SouthStockholm.jpg
        /pool2/scenic/BlackRock_Cottage.jpg
        /pool2/scenic/seward.jpg
        /pool2/scenic/canadian_rockies_csg110_EmeraldBay.jpg
        /pool2/scenic/canadian_rockies_csg111_RedRockCanyon.jpg
        /pool2/scenic/canadian_rockies_csg112_WatertonNationalPark.jpg
        /pool2/scenic/canadian_rockies_csg113_WatertonLakes.jpg
        /pool2/scenic/canadian_rockies_csg114_PrinceOfWalesHotel.jpg
        /pool2/scenic/canadian_rockies_csg116_CameronLake.jpg
        /pool2/scenic/Castilla_Spain.jpg
        /pool2/scenic/Central-Park-Walk.jpg
        /pool2/scenic/CHANNEL.JPG



In my best Hugh Laurie voice trying to sound very Northeastern American, that is so cool! But we're not even done yet. Let's take this list of files and restore them - in this case, from pool1. Operationally this would be from a back up tape or nearline backup cache, but for our purposes, the contents in pool1 will do nicely.

First, let's clear the zpool error counters and return the spare disk. We want to make sure that our restore works as desired. Oh, and clear the FMA stats while we're at it.
# zpool clear
# zpool detach pool2 spare1

# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset

# fmadm reset zfs-retire   
fmadm: zfs-retire module has been reset

Now individually restore the files that have errors in them and check again. You can even export and reimport the pool and you will find a very nice, happy, and thoroughly error free ZFS pool. Some rather unpleasant gnashing of zpool status -v output with awk has been omitted for sanity sake.
# zpool scrub pool2
# zpool status pool2
  pool: pool2
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 14:04:56 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

# zpool export pool2
# zpool import pool2
# dircmp -s /pool1 /pool2

Conclusions and Review

So what have we learned ? ZFS and FMA are two great tastes that taste great together. No, that's chocolate and peanut butter, but you get this idea. One more great example of Isaac's Multiplicity of Solaris.

That, and I have finally found a good lab exercise for the FMA training materials. Ever since Christine Tran put the FMA workshop together, we have been looking for some good FMA lab exercises. The materials reference a synthetic fault generator that is not available in public (for obvious reasons). I haven't explored the FMA test harness enough to know if there is anything in there that would make a good lab. But this exercise that we have just explored seems to tie a number of key pieces together.

And of course, one more reason why Roxy says, "You should run Solaris."

Technocrati Tags:

Feb 18 2008, 02:30:49 PM CST Permalink Comments [4]

20071003 Wednesday October 03, 2007
LIve Upgrade from Solaris 10 11/06 to 8/07 without nonglobal zones
Live Upgrade is one of the most useful Solaris features, yet in my travels around the US I still don't see it used as much as I would like. I can think of several reasons for this - not all of them totally valid And I'm sure there are other reasons, but these are the ones I hear most often.

Let's turn our attention to the topic at hand, upgrading a Solaris 10 11/06 system to 8/07, without zones. This example will be on an x64 system, but the SPARC approach is simular.

If you have read my earlier blog on Live Upgrade, you will recall the process is
  1. Read Infodoc Infodoc 72099 and install any required patches
  2. Install the LU packages SUNWluu SUNWlur and SUNWlucfg (if present) from the installation media
  3. lurename(1m) if you want to change the name of your new boot environment
  4. lumake(1m) or ludelete(1m) + lucreate(1m) to repopulate the target boot environment with the proper software and configuration files
  5. luupgrade(1m) to upgrade the target boot environment
  6. luactivate(1m) to activate the new boot environment
  7. init 0 to perform the file synchronization and conversions, create the new boot archive and update your GRUB menu


So I fire up my web browser and run over to SunSolve to pick up Infodoc 72099 and see a rather large set of patches. And there are two lists, one for systems with non-global zones and one without. Since we're looking at a system without non-global zones we will start with the shorter of the two lists (the next article will cover systems with nonglobal zones).

Apparently we need patches
	 
Solaris 10 	x86 	118816-03 or higher 	nawk patch 	 
Solaris 10 	x86 	120901-03 or higher 	libzonecfg patch 	 
Solaris 10 	x86 	121334-04 or higher 	SUNWzoneu required patch 	 
Solaris 10 	x86 	119255-42 or higher 	patchadd/patchrm patches 	 
Solaris 10 	x86 	119318-01 or higher 	SVr4 Packaging Commands (usr) Patch 	 
Solaris 10 	x86 	117435-02 or higher 	biosdev patch for GRUB Boot 	 

Reboot after installation 	 

Solaris 10 	x86 	120236-01 or higher 	SUNWluzone required patches 	 
Solaris 10 	x86 	121429-08 or higher 	SUNWluzone required patches 	 
Solaris 10 	x86 	121003-03 or higher 	pax patch 	 
Solaris 10 	x86 	123122-02 or higher 	prodreg patch 	 
Solaris 10 	x86 	121005-03		sh patch 	 
Solaris 10 	x86 	119043-10		/usr/sbin/svccfg patch 	 
Solaris 10 	x86 	121902-02		i.manifest r.manifest class action script patch 	 
Solaris 10 	x86 	120901-03		libzonecfg patch 	 
Solaris 10 	x86 	120069-03		telnet security patch 	 
Solaris 10 	x86 	120070-02		cpio patch 	 
Solaris 10 	x86 	123333-01		tftp patch


Hmmm, seems like a lot of patches and a required reboot! So I fire up our new friend updatemanager to patch my system. I see that there is a new updatemanager patch available (121119-13), so I installed that one all by itself and restarted updatemanager.

I soon realize that my choice of patching tools is making this a bit challenging. Users of patch tools such as Patch Check Advanced(PCA) may have an easier time, but I was determined to do this with updatemanager, with occasional help from the patch READMEs in SunSolve.

The list of patches required for this upgrade applies to any release of Solaris 10. A fresh install of a Solaris 10 11/06 system only needed the following four patches - which is a lot better than I first thought.
	 
119255-42	 
121429-08
126539-01 as it replaces the required 121902-02
125419-01 as it replaces the required 120069-03
The difficulty with updatemanager was with the set of obsoleted patches. Something like the required 121902-02 that was obsoleted by 126539-01 which was installed took a bit of manual trolling through patch READMEs. So I'll save you the research - it came down to only the four above patches.

One important note: the required reboot after patch 117435-02 wasn't needed after all - so I'll try to save all of you Solaris 10 11/06 users one reboot. While I have your attention, it is a good idea, if not a best practice, to install patch and packaging patches separately.

Feeling a lot better about this process, I proceed and install the four required patches using updatemanager in two steps (119255-42 and then the other three patches) and all succeeded, as expected. All that was left to do was finish the standard procedure
# mount -o ro -F hsfs `lofiadm -a /export/iso/s10u4/solarisdvd.iso` /mnt 
# pkgadd -d /mnt/Solaris_10/Product SUNWlur SUNWluu SUNWlucfg 
# lurename -e nv71 -n s10u4 
# lumake -n s10u4 
# luupgrade -u -s /mnt -n s10u4 
# luactivate s10u4 
# init 0 


And all went as expected. Next time I will tackle the longer list of patches and examine the same upgrade path, but with nonglobal zones.

Technocrati Tags:

Oct 03 2007, 03:08:09 PM CDT Permalink Comments [4]

20070326 Monday March 26, 2007
Securing MySQL using SMF - the Ultimate Manifest
The best way to learn the Solaris Service Management Facility (SMF) is to migrate a legacy service. The version of MySQL that comes with Solaris is an ideal application. It is relatively simple, has few dependencies, and can be done in just a few quick edits of an existing manifest (utmp would be a good starting template). We cover the basic process in the SMF Deep Dive and various people have contributed manifests to OpenSolaris and Blastwave. While these are good illustrations of how easy the process is, few show what SMF can really do. The motivation for this how-to came from a recent Solaris Bootcamp attendee who asked "what was wrong with the RC scripts the way they were ?".

Without skipping a beat.....
  1. Easy support of multiple service instances
  2. Deterministic location of service log files
  3. Timeouts on the start and stop methods to prevent system boots from hanging indefinetely.
  4. Quickly observable service state
  5. Flexible service dependencies
  6. Automatic restarting of the service upon failure
Upon closer inspection, recognizing when the service terminated and restarting it automatically isn't that special for mysql. The mysqld_safe daemon actually performs that step, restarting the database server if it fails. Yes, this is unique to mysql and may not exist for other services. Certianly, if the mysqld_safe parent actually fails, SMF does provide an additional capability by automatically restarting it. But we need more.

Most of the service migration demonstrations are single instance with no downstream application dependencies - so we still need more.

The mysql service start script runs through a set of configuration files, setting variables and starting a detached daemon, so it's highly unlikely that it will ever get stuck. Sure, it can get hacked and have bad things happen to it, but as delivered it is relatively safe. So we still need more.

The answer to the question lies in security. SMF provides a rich set of security features that demonstrate the power of Solaris Role Based Access Control (RBAC) and least privilege. Contrary to what you might think, these features are quite easy to use - once you learn a few simple concepts. This is how we will answer the question "what was wrong with the RC scripts the way they were?".

Authorizations

One of the most useful applications of RBAC is to create adminstration and operations roles. While the details of these roles will vary from customer to customer, the common theme is that operator roles should be able to start and stop a service in a safe manner and an administrative role should be able to modify service properties (of which some of those may be the ability to start or stop the service).

Historically this has been accomplished by third party security software inserting itself all over the kernel (sometimes in a manner that makes upgrades or maintenance difficult) or custom scripts that make use of setuid(2). Solaris 10 can perform many of these functions with just a few entries to some configuration files, and SMF makes this process extremely easy.

You can get lots of valuable information on Solaris Security features (roles, profiles, auths, privileges) at the OpenSolaris Security Community. As you navigate the wealth of white papers, ARC cases, and how-to examples, think of Solaris authorizations as the magic that makes this possible (or more precisely simple).

In a sentence, auths are labels that a privileged application uses to restrict access to it's features. In our case the privileged applications are svcadm(1M) and svccfg(1M). If you read the smf_security(5) man page (which is excellent reading) you will see that SMF provides several authorizations. Now this is getting interesting. So it appears that we can use either the action or modify authorization for the operator role. So which one do we use ?

The action_authorization would only allow running the method but not modifying any of the properties. The implication is that you can do
# svcadm enable -t mysql
but not
# svcadm enable mysql
The difference between the two commands is that enable without -t will try to set the property general/enabled to true in additional to running the start method. This would require the value_authorization. But value_authorization will allow you to change (almost) any property in the property group (in this case the general property group), so let's see what else value_authorization will let you do.
# svcprop -p general ssh
general/enabled boolean true
general/action_authorization astring solaris.smf.manage.ssh
general/entity_stability astring Unstable
general/single_instance boolean true
Hmmm, the only properties that might be abused would be the authorizations, but those require additional authorizations (solaris.smf.modify) to change. So it would seem that value_authorization would be safe for an operator role - unorthodox perhaps, but safe. modify_authorization would allow the creation of other service properties, and if limited to the general property group might be confusing, but relatively harmless - unless of course we add a new general property later. For this reason, modify_authorization would not be a good canidate for an operator role.

So which authorization to use ? Use action_authorization if you want a user (or role) to be able to start and stop the service, but not make the change permanent. This is the most common case. Use value_authorization in the general property group if you want that user or role to be able to permanently turn a service on or off - this is generally an adminstrative role.

Let's put this all together.

Start with your existing SMF manifest for MySQL. If you don't have one, you can use mine at http://blogs.sun.com/resources/bobn/mysql.xml or Keith Lawson's contributed MySQL manifest at the OpenSolaris SMF Contributed Manifests and Methods page.

Add the following section
<property_group name='general' type='framework'>
        <propval name='action_authorization' type='astring'    value='mysql.operator' />
       <propval name='value_authorization' type='astring'   value='mysql.administrator' />
</property_group>

Import the new manifest by the method of your choice (svccfg import, /lib/svc/method/manifest-import, or reboot) and your new MySQL can be managed by auths. So how to we get those auths assigned to users (or roles ?).

Authorizations are granted to users and roles by the configuration file /etc/user_attr. You can read the user_attr(4) man page for all of the details, but the process is to add auths=mysql.operator to the user or role entry. For example
# grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator
It is possible that a user or role may not be present in /etc/user_attr. In that case just add a line like the one above and assign the appropriate auth.

Let's see all of this in action.
% auths
mysql.operator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable

% svcadm enable -t mysql
% svcs mysql
STATE          STIME    FMRI
online         15:51:02 svc:/application/mysql:default

So far so good.
% svcadm enable mysql
svcadm: svc:/application/mysql:default: Permission denied.

Why did this fail ?
% svcprop -p general mysql
general/enabled boolean true
general/action_authorization astring mysql.operator
general/entity_stability astring Unstable
general/single_instance boolean true
general/value_authorization astring mysql.administrator

Because enable also tries to set the general/enabled property - and that requires value or modify authorization. Change my user definition in /etc/user_attr
% grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator,mysql.administrator
% auths
mysql.operator,mysql.administrator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable

% svcadm enable mysql
% svcs mysql
STATE          STIME    FMRI
online         16:10:37 svc:/application/mysql:default

This is all very cool - but we can still do more.

Removing Root from the Equation

For both simplicity and compatibility with other operating systems, the MySQL service is started by a script that is run as root. This script is generally linked into /etc/rc3.d, but since we have converted it to an SMF service we have many more options. We have already looked at delegated administration using auths, time to turn our attention to privileges.
# /etc/sfw/mysql/mysql.server start # ps -ef | grep mysqld | grep -v grep mysql 1975 1955 0 21:43:17 pts/8 0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --user=mysql --pid root 1955 1 0 21:43:17 pts/8 0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa # /etc/sfw/mysql/mysql.server stop This suggests two immediate questions. Does the parent mysqld_safe really have to run as root, or can it be started as a lesser privileged user ? If it can run as a non-root user, exactly what privileges are required to run mysql ?

The answer to the first question is simple: it can be run as a regular user. It only runs as root out of convenience to operating systems that don't have as sophisticated a security framework as Solaris.
#  su - mysql
Sun Microsystems Inc.   SunOS 5.11      snv_57  October 2007
$ sh /etc/sfw/mysql/mysql.server start
$ /usr/sfw/bin/mysqladmin status
Uptime: 1174  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.001
$ sh /etc/sfw/mysql/mysql.server stop
Killing mysqld with pid 1975
Wait for mysqld to exit done
$ exit
#
Now that we have established the fact that a fully privileged user isn't required to run MySQL, what privileges are are really required ? How far can we restrict the mysql user ? Glenn Brunette's privilege debugger privdebug.pl is the perfect tool to help us answer this question.
# privdebug.pl -f -v  -e "su - mysql /usr/sfw/sbin/mysqld_safe --user=mysql"
STAT TIMESTAMP          PPID   PID    PRIV                 CMD
USED 2005619300419      2211   2212   proc_taskid          su
USED 2005620883559      2211   2212   proc_setid           su
USED 2005621147993      2211   2212   proc_setid           su
USED 2005621161490      2211   2212   proc_setid           su
USED 2005621165094      2211   2212   proc_setid           su
USED 2005630560973      2211   2212   proc_exec            su
Starting mysqld daemon with databases from /var/mysql                                  contract_event       
USED 2005679230394      2211   2212   proc_fork            sh
USED 2005750348321      2211   2212   proc_fork            sh
USED 2005751386190      2212   2214   proc_exec            sh
USED 2005756249415      2211   2212   proc_fork            sh
USED 2005757238096      2212   2215   proc_fork            sh
USED 2005758495289      2212   2215   proc_exec            sh
USED 2005761778059      2211   2212   proc_fork            sh
USED 2005762623018      2212   2217   proc_fork            sh
USED 2005763874569      2212   2217   proc_exec            sh
USED 2005767441408      2211   2212   proc_fork            sh
USED 2005768337263      2212   2219   proc_exec            sh
USED 2005772916576      2211   2212   proc_fork            sh
USED 2005773996432      2212   2220   proc_fork            sh
USED 2005775465400      2212   2220   proc_exec            sh
USED 2005778750305      2211   2212   proc_fork            sh
USED 2005779846375      2212   2222   proc_exec            sh
USED 2005782042348      2211   2212   proc_fork            sh
USED 2005783110622      2212   2223   proc_exec            sh
USED 2005785636236      2211   2212   proc_fork            sh
USED 2005786824801      2212   2224   proc_exec            sh
USED 2005788593079      2212   2224   proc_exec            nohup
USED 2005790693138      2212   2224   proc_exec            nohup
USED 2005792812264      2211   2212   proc_fork            sh
USED 2005794010658      2212   2225   proc_exec            sh
USED 2005795756145      2212   2225   proc_exec            nohup
USED 2005797704273      2212   2225   proc_exec            nohup
NEED 2005799674735      2211   2212   file_dac_write       sh
USED 2005800708905      2211   2212   proc_fork            sh
USED 2005801869396      2212   2226   proc_exec            sh
USED 2005804780370      2211   2212   proc_fork            sh
USED 2005805854317      2212   2227   proc_exec            sh
USED 2005807860051      2211   2212   proc_fork            sh
USED 2005808907677      2212   2228   proc_exec            sh
USED 2005811293197      2211   2212   proc_fork            sh
USED 2005812393916      2212   2229   proc_exec            sh
USED 2005814589669      2212   2229   proc_exec            nohup
USED 2005816674186      2212   2229   proc_exec            nohup
STOPPING server from pid file /var/mysql/pandora.pid                                  contract_event       
070325 22                    11  mysqld ended 18     contract_event       


Ignore the proc_taskid and proc_setid, they are artifacts of using su(1M) to run the database server as user mysql. We see that mysqld only needs proc_fork and proc_exec. The file_dac_write failure comes from a call to access(2) and is not needed for proper operation.

What do we do with what we have just learned ?

Referring to the smf_method(5) man page (another excellent read), it seems that all we need to do is add a method_credential option to the various methods (start, stop, and refresh). The appropriate section of my new and improved MySQL manifest now looks like
        <exec_method   type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='60'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
               </method_context>
        </exec_method>
        
        <exec_method   type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='120'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
        
        <exec_method   type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart'  timeout_seconds='120'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
   

So we quickly modify our manifest and import it using one of the standard methods (svccfg import, /lib/svc/method/manifest-import, or a reboot) and we should be done, right ? Well...... not exactly - but we're close.
% svccfg enable mysql
% svcs mysql
STATE          STIME    FMRI
maintenance    21:53:37 svc:/application/mysql:default

$ tail -5 `svcprop -p restarter/logfile mysql`
[ Mar 26 21:51:12 Method "stop" exited with status 0 ]
[ Mar 26 21:53:36 Enabled. ]
[ Mar 26 21:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
svc.startd could not set context for method: chdir: No such file or directory
[ Mar 26 21:53:37 Method "start" exited with status 96 ]

Doh! When we followed the MySQL installation instructions at /etc/sfw/mysql/README.solaris.mysql we created a user account called mysql. But we didn't specify a home directory, did we ? No - so the default template value of /home/mysql was used. But there is no /home/mysql, is there ? Well, no.

How do we fix this ?

Set a reasonable home directory for the mysql user. How about /var/mysql ? Elsewhere in the installation instructions we did set ownership and proper permissions to this directory - so that would seem like a reasonable home directory.

As root
# usermod -d /var/mysql mysql
That is one solution, but it may not be practical for all cases. Perhaps a better idea would be to provide a working directory for each of the methods. The benefit is that I could set it differently for each service instance. This would be done in the method_context tag for the method. So I modify my service manifest to look like
        <exec_method   type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='60'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
               </method_context>
        </exec_method>
        
        <exec_method   type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='120'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
        
        <exec_method   type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart'  timeout_seconds='120'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
Reimport the manifest and let's see how things go.
# svccfg import /var/svc/manifest/application/mysql.xml
# svcadm clear mysql
# svcs mysql
STATE          STIME    FMRI
maintenance    22:17:49 svc:/application/mysql:default

Argh - now what ?
# tail -5 `svcprop -p restarter/logfile mysql`
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]
[ Mar 26 22:17:49 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]

Doh! Since Solaris delivers MySQL as a legacy service the start script doesn't have execute permissions for the mysql user. That's easy to fix.
# ls -l /etc/sfw/mysql/mysql.server
-rwxr--r--   1 root     sys         5655 Mar 22 17:05 /etc/sfw/mysql/mysql.server
# chown mysql /etc/sfw/mysql/mysql.server
# svcadm clear mysql
# svcs mysql
STATE          STIME    FMRI
online         22:23:08 svc:/application/mysql:default
bash-3.00$ 
Now that's more like it. One last item to check.
# ps -ef | grep mysqld | grep -v grep
   mysql 12656 12634   0 22:23:11 ?           0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --pid-file=/var/my
   mysql 12634     1   0 22:23:09 ?           0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa
   
# ppriv 12634
12634:  /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var
flags = 
        E: basic,!file_link_any,!proc_info,!proc_session
        I: basic,!file_link_any,!proc_info,!proc_session
        P: basic,!file_link_any,!proc_info,!proc_session
        L: all

Now that's what I wanted to see. The parent mysqld_safe is now running as user mysql and with exactly the right privileges. This is very cool indeed. Armed with this information we could also create a zone and use the limitpriv attribute to restrict the zone privilege - but we'll leave that for another day.

Conclusion

It is quite easy to leverage not only Solaris authorizations but to run services with restricted privileges. We have presented a few templates and a general approach that should make this process less cumbersome.

More important though - we now have a compelling reply when asked "what was wrong with the RC scripts the way they were?"

Technocrati Tags:

Mar 26 2007, 08:13:24 PM CDT Permalink Comments [6]

20070320 Tuesday March 20, 2007
Zones in a Flash - Literally
Fantastic improvements have been made in the Solaris installation and upgrade process - even more in OpenSolaris (available in the various community releases). As we examined the cloning feature introduced in Solaris 10 11/06, it became apparent that we have stumbled upon a most intriguing capability. When combining zone cloning with the attach/detach capability we have discovered a model for flashing zones: zoneflash.

In a recent boot camp we took a look at this in more detail. Unfortunately the slides (which will be posted soon) didn't quite follow the level of depth we were exploring. Several people asked for notes on how this works - and here they are. The irony is that it will take longer to read about it than it does to perform the actual process - but it is so cool.

The Promise

We start with a fresh Solaris system. In this case just live upgraded from media, but it could have been jumpstarted from media or a flash archive. The key point here is that the system has had very little done to it, other than naming and some software installation. Since zone attach makes sure that key system components (specifically packages and patches) are compatible, it makes sense to build our flashzones on a system that will look similar to those that will be built in the future.

So how many zones will we build ? That's a good question. If this were system flasharchives the answer would be as few as possible - one per architecture in the most efficient case. But these zoneflashes are different - just applications, some metadata, and perhaps some customizations (naming, security, SMF). It seems reasonable to create one zoneflash for each type of application server you would deploy - think of it as a userspace template. In this example I have chosen four: a blank uncustomized flash (for building a new zoneflash in a flash), database server (MySQL), web server (apache2), and the community edition of webmin (just another application).

Our procedure will be to build a minimal default zoneflash, run it through first boot to populate the SMF repository, and then clone it for the remaining zoneflashes. Each of these will be booted, customized for the particular application, and tested to make sure everything is operating properly.

We will then detach the zones and move the detached zoneroots onto some media that can be transported. Of course, keeping with the theme of zones and flash, the transport could be the flasharchive itself. How cool would it be to jumpstart a server using flasharchives and have all the application zones already present in a known location, such as /zoneflash ? Unfortunately, I'm sitting in seat 18A on an American Airlines flight to Los Angeles and don't quite have the required infrastructure to do that sort of test. But I do have a USB stick and multiple boot environments. That will do nicely.

Once attached, we will clone the zoneflashes as necessary, adding resources (network, local filesystems) and attributes (resource controls) required for the proper operation of the application. When finished we will detach the zoneflashes so they may be used elsewhere.

The Turn

The first step is to build and boot a simple generic sparse root zone. Since this zone isn't really meant for operation, most zonecfg attributes (network configuration, resource limits, et al) will be skipped. We will add them later when we build the real zones - remember, these are just user space application templates.

# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create
zonecfg:flashdefault> set zonepath=/local/default
zonecfg:flashdefault> add inherit-pkg-dir; set dir=/opt; end
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
#
# zoneadm -z default install
A few minutes later we have an installed zone, ready for first boot. Since I've attended my Solaris Zones Best Practices class, or at least read the materials, I know how to build a sysidcfg file that will satisfy the sysidtool first boot service. This will allow the zone to boot up all the way without any additional console interaction. Let's do that for our new zone.
# echo > /local/default/root/etc/sysidcfg <<EOF
name_service=NONE
nfs4_domain=dynamic
security_policy=NONE
root_password=xxxxxxxxxx        You supply your own encrypted string from /etc/shadow, I'm not going to post mine!
system_locale=C
terminal=ansi
timezone=US/Central
network_interface=NONE {hostname=default}
EOF
# zoneadm -z default boot
# zlogin -C default 
We need to let first boot processing complete. Since we supplied a valid sysidcfg, it is just a matter of waiting for manifest-import and sysidtool to complete their magic. When complete, login in and take a look around to make sure all is well. Once satisfied, shut down the zone (either from inside the zone or from the global zone) - we are through with it for now.
(from the global zone)
# zoneadm -z default halt
Now we are done with this first zone. Time to clone it for our remaining application zones. Please pardon a bit of inline shell scripting - I hate to type the same thing over and over and over. Sort of makes for a nice script template, doesn't it ? Not quite the sophistication of Brad Digg's zonemanager, but it will do nicely for our example.

# for zone in webmin mysql web
? do
        echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
        zoneadm -z ${zone} clone default
        echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
        echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
        echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
        echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
        echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
        echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
        echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
        echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
        zoneadm -z ${zone} boot
done
#
What in the heck was that all about ? OK, one more time - line by line with annotation.

# for zone in webmin mysql web
do

A quick interactive loop for the creation of three application zones. The variable ${zone} will be set to the name of the zone we are trying to construct.
echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
A one liner that creates a new zone configuration based on the already existing default. At this point the only thing we need to change is the zonepath, and it should be set to /local/${zone}.
        zoneadm -z ${zone} clone default
We recognize this as a zone cloning operation. The zone root is copied and a /reconfigure is created in the new zone root so that sysidtool performs a complete configuration on first boot. If you happen to be running on a recent release of OpenSolaris, you can put your zoneroot on ZFS and the cloning operating will only take a few seconds and very little additional disk space will be required. Those of us on Solaris 10 11/06 will have to wait for the 160MB or so to be copied. Still better than the 9 minutes to go through a complete zone installation.
        echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
        echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
        echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
        echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
        echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
        echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
        echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
        echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
This step creates a custom sysidcfg file for each zone. Remember to supply your own root password from /etc/shadow in the global zone. This answers all of the sysidtool questions, including the NFSv4 question.
	zoneadm -z {zone} boot
Boot the zone. If we have done everything correctly, the next interaction will be with console login.

done
Close the for loop in the interactive script. This process will take a few minutes on Solaris 10 11/06, or if we are being clever with OpenSolaris and ZFS - a few seconds.

Now for the hard part - customizing the individual application zones. Well, it's not all that difficult. And if you do this regularly, you probably have scripts to do most of the work. It's just individual application installation and customization.

Here is what I did for my example zones.
MySQL
The installation instructions for the Solaris 10 MySQL can be found in /etc/sfw/mysql/README.solaris.mysql. There is a typo in the Solaris 10 version of the README. It will cause a lot of grief if you cut and paste without looking at the results. Fortunately it has been corrected in nevada (aka OpenSolaris Community Edition).

Boot the mysql zone and log in as root.
# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql       This line is incorrect in the Solaris 10 README - my chmod works better with two arguments
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf
The installation instructions continue by linking the start script into /etc/rc3.d. Since we are big SMF fans in these parts, let's do that instead. Feel free to use my MySQL manifest as it contains a couple of cool features (value and action authorizations - more on that later).

Since the mysql zone doesn't have any networking configured, perform this next step from the global zone. If you already have a suitable manifest, or have stashed mine away somewhere in the global zone you can use that instead.
# cd /local/mysql/root/var/svc/manifest/application
# wget http://blogs.sun.com/bobn/resource/mysql.xml
It's probably a good idea to make sure that all of this is working properly. Either reboot the mysql zone, run the manifest-import service manually, or run svccfg import on the new manifest. Your choice. What you should see upon completion is
# svcs mysql
STATE          STIME    FMRI
online         14:41:19 svc:/application/mysql:default

# /usr/sfw/bin/mysqladmin status
Uptime: 459  Threads: 1  Questions: 2  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.004

We're done for now. Unless of course you want to go for some extra credit. In that case
  1. Set up a web server with PHP support. Apache 1 plus the SFWmphp package from the Solaris Companion will do just fine.
  2. Download and unpack phpMyAdmin in the webserver htdocs directory.
  3. Create a user with the mysql.operator authorization
  4. Create a user with the mysql.administrator authorization

Shut down the mysql zone.
Web
This is about as easy as it gets. Boot the web zone and perform the following steps.
# cp /etc/apache2/httpd.conf-example /etc/apache2/httpd.conf
# svcadm enable apache2

A quick check to make sure all is well.
# svcs apache2
STATE          STIME    FMRI
online         17:17:41 svc:/network/http:apache2


# telnet localhost 80
Trying ::1...
telnet: connect to address ::1: Network is unreachable
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>501 Method Not Implemented</title>

Connection to localhost closed by foreign host.
We're done for now. Shut down the web zone.
Webmin
This one is a little more complicated. We did this one last time in the zone cloning, but it is worth a second look.

Our task here is to replace the Solaris webmin with the latest download from http://webmin.com The technique we are using will allow us to install a custom version of an application into a sparse root zone. Specifically, webmin.com's package installs into /opt/webmin, but /opt is a read-only inherited-pkg-dir. The easiest solution for this would be the creation of a symbolic link in the global zone /opt to point to a location that can be safely written by each non-global zone. In my example that would be /local-pkgs.

In the global zone, create the link in /opt, create the local package directory in the webmin zoneroot, and download the latest webmin package.
# ln -s ../local-pkgs/webmin /opt/webmin
# mkdir -p /local/webmin/root/local-pkgs/webmin
# cd /local/webmin/root/var/tmp
# wget http://prdownloads.sourceforge.net/webadmin/webmin-1.330.pkg.gz
# gunzip webmin-1.330.pkg.gz

Now boot the webmin zone and log in as root.
# zoneadm -z webmin boot
# zlogin webmin
Remove the Solaris webmin packages (SUNWwebminu SUNWwebminr). The usr package needs to be removed twice - the first pkgrm will leave it as a partially installed package, the second will completely remove it - at least as far as our zone (and future patching) is concerned. Once removed, install the webmin.com version, which should be conveniently located in /var/tmp.
# pkgrm SUNWwebminu SUNWwebminr SUNWwebminu
# pkgadd -d /var/tmp/webmin-1.330.pkg
We are done with this zone. Shut it down.
Detach
We have just built four zones: an empty zone suitable for future customizations, one with the Solaris webmin replaced by the community edition, one with a working MySQL database, and one with a webserver. The last task to be performed on these zones in their current state is to be detached, another new feature in Solaris 10 11/06. Zone detach will copy the zone configuration into the zoneroot (to be used with a subsequent zone attach) and sets the current zone state to configured. You can even delete the zone configurations as a final cleanup prior to building a flash archive.
# zoneadm -z default detach
# zoneadm -z webmin detach
# zoneadm -z mysql detach
# zoneadm -z web detach
# zonecfg -z default delete -F
# zonecfg -z webmin delete -F
# zonecfg -z mysql delete -F
# zonecfg -z web delete -F

And flash
Unless the person in 18B wants to be a jumpstart server, we will have to simulate jumpstart/flasharchive process. We can do this by booting into an alternate boot environment and then delivering the detached zoneroots by some sort of shared or removable storage - something like a USB memory stick. When we are done with this exercise, our zoneflashes will still be on the memory device, ready for their next use. Since the zones will never be booted, just cloned, the speed of the memory device really isn't important.

We need to prepare the USB memory stick (currently formatted as FAT16). We will use rmformat -l to locate the device, fdisk to put a proper label on it, finally newfs for installing a proper file system. ZFS would be interesting, but it would just get in our way later.
# rmformat -l
Looking for devices...
     1. Logical Node: /dev/rdsk/c2t0d0p0
        Physical Node: /pci@0,0/pci1179,1@1d,7/storage@4/disk@0,0
        Connected Device:          USB DISK 2.0     PMAP
        Device Type: Removable
        Bus: USB
        Size: 984.0 MB
        Label: 
        Access permissions: 
     2. Logical Node: /dev/rdsk/c1t0d0p0
        Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
        Connected Device: TEAC     DW-224E-A        7.2A
        Device Type: CD Reader
        Bus: IDE
        Size: 
        Label: 
        Access permissions: 

# fdisk /dev/rdsk/c2t0d0p0
3 (to delete the existing partition)
1 (to create a new Solaris partition)
5 (to exit and write the new label)

# newfs /dev/rdsk/c2t0d0s2
newfs: construct a new file system /dev/rdsk/c2t0d0s2: (y/n)? y
/dev/rdsk/c2t0d0s2:     2009088 sectors in 981 cylinders of 64 tracks, 32 sectors
        981.0MB in 62 cyl groups (16 c/g, 16.00MB/g, 7680 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 32832, 65632, 98432, 131232, 164032, 196832, 229632, 262432, 295232,
 1705632, 1738432, 1771232, 1804032, 1836832, 1869632, 1902432, 1935232,
 1968032, 2000832
 
# mkdir /tmp/flash
# mount /dev/dsk/c2t0d0s2 /tmp/flash
# cd /local
# find default webmin web mysql -print | cpio -pdum /tmp/flash
# umount /tmp/flash
We are now done with the original system. At this point we would create a flasharchive (with the detached zoneroots in a convenient place in the archive).

The Prestige

The final act in our magic trick is the delivery. Specifically the transport, reattachment, and subsequent cloning of the zoneflashes on a new system. 18B is now asleep and I really don't want to disturb him, so I'll do this part myself. I'll boot my laptop into another boot environment - built from the same media using the same Live Upgrade method as the boot environment that created the zones.

We begin by mounting the removable media (USB memory stick) that contains the zoneflash. Do take a look around, it is quite likely that our friend volfs has already done this for us. Remember - if we were using a flasharchive to deliver the zoneflash this step would be unnecessary.
# mkdir /flash
# mount /dev/dsk/c2t0d0s2 /flash        (we used rmformat -l to derive the device name)
Now that our zoneflashes have arrived, time to reattach them. The first step is to create zone configurations. If you recall, these were stored in the zoneroot when they were detached. The zonecfg command create -a is used to retrieve the stored configuration information and adapt it to the new system - specifically the new location of the zoneroot. Once configured we use zoneadm attach to reconnect them.

The sequence to reattach our default zone, now called flashdefault, would look something like this.
# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create -a /flash/default
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
# zoneadm -z flashdefault attach
We'll be a little more clever attaching the other three zones.
# for zone in webmin web mysql
  do
      echo "create -a /flash${zone}" | zonecfg -z flash${zone}
      zoneadm -z flash${zone} attach
  done
At this point our zoneroots are still on the USB memory device - but don't worry, these zones will never be booted. Their only purpose is to deliver preconfigured zones. We will use zone cloning to create our real application zones.

Which we will now do. It is very convenient to use the flashzone as a template for our new zone in case there were some special attributes like limitpriv that we might want to preserve. We will also need to add items that were not present in the zoneflashes - specifically networking and local file systems. Once we are satisfied with the zone configurations we will clone the zoneflash. If we are only building one of each type of zone we can detach the zoneflash so that other administrators can use it on their systems.

Let's do this for the mysql zone.
# zonecfg -z mysql
mysql: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:mysql> create -t flashmysql
zonecfg:mysql> set zonepath=/zones/mysql
zonecfg:mysql> add net; set physical=e1000g0; set address=192.168.100.102/24; end
zonecfg:mysql> add fs; set dir=/export; set special=/export; set options=[rw,nosuid,nodevices]; set type=lofs; end
zonecfg:mysql> commit
zonecfg:mysql> exit

# zoneadm -z mysql clone flashmysql
Copying /flash/mysql...

# zoneadm -z flashmysql detach

# echo "name_service=NONE" >    /zones/mysql/root/etc/sysidcfg
# echo "nfs4_domain=dynamic" >> /zones/mysql/root/etc/sysidcfg
# echo "security_policy=NONE" >> /zones/mysql/root/etc/sysidcfg
# echo "root_password=xxxxxxxxxxx" >> /zones/mysql/root/etc/sysidcfg
# echo "system_locale=C" >> /zones/mysql/root/etc/sysidcfg
# echo "network_interface=NONE {hostname=mysql}" >> /zones/mysql/root/etc/sysidcfg
# echo "terminal=ansi" >> /zones/mysql/root/etc/sysidcfg
# echo "timezone=US/Central" >> /zones/mysql/root/etc/sysidcfg

And for the finale - boot the newly flashed mysql zone and you should see an enabled and operating mysql service.
# zoneadm -z mysql boot
# zlogin -C mysql
[Connected to zone 'mysql' console]
Hostname: mysql
Creating new rsa public/private host key pair                           
Creating new dsa public/private host key pair
Mar 20 06:15:44 mysql sendmail[1719]: My unqualified host name (mysql) unknown; sleeping for retry
Mar 20 06:15:44 mysql sendmail[1722]: My unqualified host name (mysql) unknown; sleeping for retry

mysql console login: root
Password: 
Last login: Mon Mar 19 17:10:10 on console
Mar 20 06:15:49 mysql login: ROOT LOGIN /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_57  October 2007
# 
# svcs mysql
STATE          STIME    FMRI
online          6:31:28 svc:/application/mysql:default
# /usr/sfw/bin/mysqladmin status
Uptime: 8  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.125

How cool is that ? Not only did we clone the zone, but since the database is in /var, it was cloned as well. Perhaps not practical for every situation, but still pretty cool.

I will leave the flashing of default, web, and webmin as an exercise to the reader. Follow the sequence we used for the mysql zone and you should have four working zones, built from a flash like mechanism that can be delivered via removable media, flasharchive, or shared storage.

Next time we'll take a closer look at MySQL and explore running it as a less privileged user. We'll also look at the action and value authorizations.

Technocrati Tags:

Mar 20 2007, 04:01:11 PM CDT Permalink Comments [1]

20070216 Friday February 16, 2007
Cloning Isn't Just for Sheep Any More


While it may not have the social implications nor headline appeal of the now famous Dolly the Sheep, the zone cloning feature introduced with Solaris 10 11/06 is worth further investigation. Before we do that, it is probably a good idea to review basic zone creation and installation prior to the new cloning capability.

Building Zones the Old Fashioned Way

The first step in the creation of a zone is establishing it's configuration. This is done by conversing with our friend, zonecfg(1M), who handles all the details of writing the configuration xml file in /etc/zones and updating the zones index file /etc/zones/index.

Such a conversation might go something like....
# zonecfg -z zone1
zone1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone1> create
zonecfg:zone1> set zonepath=/zones/zone1
zonecfg:zone1> add inherit-pkg-dir; set dir=/opt; end
zonecfg:zone1> add net; set physical=iprb0; set address=192.168.100.101/24; end
zonecfg:zone1> add fs; set dir=/export; set special=/export; set options=[rw,nosiud]; set type=lofs; end
zonecfg:zone1> verify
zonecfg:zone1> commit
zonecfg:zone1> exit
#


If you grok zones then you recognize this as a typical sparse root zone. If you have attended one of my zones best practices workshops then you will also notice that I'm following my own advice and making /opt an inherited package directory.

A quick check to make sure all is well.
# zoneadm list -cv
  ID NAME             STATUS         PATH                           BRAND     
   0 global           running        /                              native    
   - zone1            configured     /zones/zone1                   native    


All is as it should be (which is always the case for a how-to example).

The next step is a rather magical affair where the zoneroot is populated. This process is initiated by uttering the following sequence
# zoneadm -z zone1 install
Once spoken, fantastic things start happening behind the scenes - all of them by our good friend Live Upgrade. The actual sequence of events is something like
  1. Create the new zoneroot if it doesn't already exist. If it does exist make sure the permissions are set to 700 and it is owned by [0,0].

  2. Mount all of the inherit-pkg-dir and file systems listed in the zone configuration file.

  3. Create a candidate list of files for the new zoneroot by looking at the global zone contents file /var/sadm/install/contents.

    On my laptop daily driver, this totals approximately 2 million files.

  4. Pick from this list all files that should be delivered to the new zone root by removing all files from packages that are marked as global zone only (SUNW_PKG_THISZONE is set to true)

    We're still over 2 million files, folks!

  5. From the remaining list of files, remove all of those that will be delivered via inherit-pkg-dir directories.

    This is why I like inherit-pkg-dir. We are now down to about 2,300 files. If not for inherit-pkg-dir I would be hitting my boss up for a lot more storage.

  6. Copy all of these files from the global zone into the new zoneroot, replacing commonly edited configuration files with those that were originally delivered with the package (ie /etc/passwd).

  7. Once the files are in place there is one more step to perform. Some of the package have preinstall and postinstall scripts that might do something important. These need to be run, even if all of the files are delivered via inherited directories. So in package dependency order, all of the packages identified as applicable to the new zone (SUNW_PKG_THISZONE=false) are installed sequentially.

  8. Update the zones index file /etc/zones/index marking the new zone as installed.

  9. Unmount all of the file systems mounted in step 2.
And we are done, with the first part. The amount of time this takes can be estimated as O(sparseness, number of packages, disk speed). To speed up this process I would have to increase the degree of sparseness, which is pretty hard to do once /opt has been added. I could also decrease the number of packages in the global zone - this has some interesting possibilities. I could also get faster disks, but that isn't always practical, especially with a small server configuration or a home system. I may be talking myself into a minimal global zone installation with full root zones - but that sounds like a topic for another day.

Enough of the theory, how long did this really take ?

On a relatively clean Nevada (aka OpenSolaris Community Edition) install it was almost 10 minutes. The output is below and I have annotated it with the installation steps outlined above.
# time zoneadm -z zone1 install
[1] [2] Preparing to install zone .
[3] [4] [5] Creating list of files to copy from the global zone.
[6] Copying <1934> files to the zone.
[7] Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1290> packages on the zone.
Initialized <1290> packages on zone.                                 
[8] [9] Zone  is initialized.
Installation of <1> packages was skipped.
Installation of these packages generated warnings: 
The file  contains a log of the zone installation.

real    9m38.951s
user    1m26.582s
sys     2m51.252s

But we're still not done, are we ? We still have first boot processing which includes initial population of the SMF repository (which is O(number of services, speed of disks)) and system identification (which is either constant if a sysidconfig file is supplied or O(Bob's increasingly bad typing rate) if we choose an interactive dialog).

For this example the first boot process took about 3 minutes to complete.

We now have a pristine zone, ready for work. But there is more to do, isn't there ? We have to install some software, or at least configure software that is already present. In fact, these customizations might be more complicated than the zone installation process. If I had invested in developing automation scripts or was using some form of advanced provisioning technology this might not be a big deal. If I'm doing this manually then it may be quite a bit of work - and work that I don't want to repeat with regularity. In other words: I'm not likely to use lots of zones and I don't particularly look forward to OS updates.

Let's look at this a bit more and see if we can make this any easier.

This example comes from my (about to be posted) Zones workshop. In our new non-global zone we will replace the Solaris version Webmin with the community release from webmin.com.

A quick pkgchk(1M) of SUNWwebminu shows that its contents are in /usr/sfw and SUNWwebminr deposits it's payload in /etc/webmin and an SMF manifest in /var/svc/manifest/application/management. Performing the same task on the community edition of Webmin shows that it will install in /etc/webmin and /opt/webmin. The clashing of /etc/webmin indicates that these cannot easily co-exist, but complete replacement is possible (all inherit-pkg-dir destinations are contained in a single directory).

So begin by removing the Solaris version of webmin. This is all done in our new zone.
# zonename
zone1

# pkgrm SUNWwebminr SUNWwebminu



pkgrm: ERROR: unable to remove 
/usr/sfw/bin 
/usr/sfw 
/usr 
## Updating system information.

Removal of  partially failed.

At this point the root package SUNWwebminr is completely gone and SUNWwebminu is marked as partially installed. One more pkgrm(1M) and it will gone, at least as far as our package contents are concerned. The bits in /usr/sfw are still there, but without the configuration files in /etc/webmin, they are just that, bits in a directory.
# pkgrm SUNWwebminu

The following package is currently installed:
   SUNWwebminu  Webmin - Web-Based System Administration (usr)
                (i386) 11.11.0,REV=2007.01.23.02.15

Do you want to remove this package? [y,n,?,q] y

## Removing installed package instance 
(A previous attempt may have been unsuccessful.)
## Verifying package  dependencies in global zone
## Processing package information.
## Removing pathnames in class 
## Updating system information.

Removal of  was successful.

Now to install the new webmin package. While you weren't looking I put the package in /var/tmp. But there are some things to do before we can proceed. Remember, the package wants to write into /opt/webmin, but /opt is read-only. We can do a couple of things: mount a writable file system (LOFS, local real disk or NFS) onto /opt/webmin in our new zoneroot or we could create a symbolic link for /opt/webmin that would point somewhere writable. The link is much less confusing, so let's go that route this time.

In the global zone do something like
# ln -s /local/webmin /opt/webmin
# mkdir -p /zones/zone1/root/local/webmin
Now we are ready to proceed. In zone zone1, do the following
# pkgadd -d /var/tmp/webmin-1.320.pkg

The following packages are available:
  1  WSwebmin     Webmin - Web-based system administration
                  (all) 1.320

Select package(s) you wish to process (or 'all' to process
all packages). (default: all) [?,??,q]: 



Webmin has been installed and started successfully. Use your web
browser to go to

  http://zone1:10000/

and login with the name and password you entered previously.


Installation of  was successful.
Now we have a nicely customized non-global zone with one application ready to go. It wasn't all that bad, but there were a few manual steps. Multiply this by 20 or so for all of the other applications and configuration steps that you need to do for your system standards and then by 20 or so for the numbers of zones you want to provision and it is suddenly looking like a tremendous amount of work.

Until Solaris 10 11/06.

Send in the Clones: Solaris Zone Cloning

Zone cloning is a new feature that bypasses all of the steps in the zone installation process and replaces them by copying the source zoneroot and performing a sys-unconfig(1M). Of course this makes perfect sense - if you duplicate the installation process you should get the exact same results (a wise science teacher taught me that a long time ago). So why not short cut the process and just copy the zoneroot, sys-unconfig(1M), fix up the zones index file and you are done.

But it gets better than that. If we are copying the zone root then any customization performed on that zoneroot will be preserved. This includes the SMF repository. Not only do we skip the initial import, we also preserve any customizations, such as service related security hardening. Our new cloned zone would also have the community edition of Webmin instead of the one in Solaris. And it's configured, enabled, and will start automatically when the new zone boots - without requiring me to do anything else. Now that's cool.

Let's see how all this works.

Step 1 - create a new zone configuration using our clone source as a template. We need to change the zoneroot and IP address. In more complex configurations, other attributes might need to be changed, but for this simple example this is all that is required.
# zonecfg -z zone2
zone2: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone2> create -t zone1
zonecfg:zone2> set zonepath=/zones/zone2
zonecfg:zone2> select net address=192.168.100.101; set address=192.168.100.102/24; end
zonecfg:zone2> verify
zonecfg:zone2> commit
zonecfg:zone2> exit
Instead of installing a new zone, let's clone from zone1.
# time zoneadm -z zone2 clone zone1
WARNING: read-write lofs file system on '/export' is configured in both zones.
Copying /zones/zone1...

real    0m31.135s
user    0m0.431s
sys     0m3.818s

# zoneadm -z zone2 boot
# zlogin -C zone2  (or supply a sysidconfig file)
Now we're getting somewhere. Zone creation, including application configuration and setup is reduced from about 15 minutes down to 31 seconds. This is getting really cool.

Clones to the left of me, zpools to the right

But wait, there's more! There's an opportunity to make this even more efficient by taking advantage of ZFS clones. Note that this is only available in OpenSolaris at present, but consider the implications of the following example.

Note the use of zone relocation (move) - also a nifty new feature in Solaris 10 11/06.
# zpool create zfs_zones c4d0t0s2
# zoneadm -z zone1 move /zfs_zones/zone1
A ZFS file system has been created for this zone.
Moving across file systems; copying zonepath /zones/zone1...
Cleaning up zonepath /zones/zone1...

# zonecfg -z zone3
zone3: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone3> create -t zone1
zonecfg:zone3> set zonepath=/zfs_zones/zone3
zonecfg:zone3> select net address=192.168.100.101; set address=192.168.100.103/24; end
zonecfg:zone3> verify
zonecfg:zone3> commit
zonecfg:zone3> exit

# time zoneadm -z zone3 clone zone1
WARNING: read-write lofs file system on '/export' is configured in both zones.
Cloning snapshot local/zone1@SUNWzone3
Instead of copying, a ZFS clone has been created for this zone.

real    0m11.402s
user    0m0.380s
sys     0m0.412s

Wow! Under 12 seconds and a completely configured and ready to run zone is built. Throw in a sysidconfig file and we're ready to run. And by using a ZFS clone, almost no additional disk space was required for this new zone.
# df -k | grep zfs_zones
zfs_zones            1007616      27  808469     1%    /zfs_zones
zfs_zones/zone1      1007616  198590  808469    20%    /zfs_zones/zone1
zfs_zones/zone3      1007616  198592  808469    20%    /zfs_zones/zone3

1GB - 200MB - 200MB should be 600MB, but it's not. Since the zones are nearly identical at this point, only 200MB is consumed from the zpool.

Practical applications of Zone Cloning

Development environments and testbeds seem a very good fit. Build one standard configuration of a zone and clone it as necessary for each developer or test scenario. If things go wrong, which can happen while testing, just delete the zone and re-clone it. 30 seconds later you are back in business.

Shhhh - don't tell anyone, but I like the privilege restrictions of zones. I'm very likely to give a developer the root password to the zone and let them do what they need to do. The worst they can do is destroy their environment. The impact to me is two zoneadm(1M) invocations and about 30 seconds of clock time.

The better use case comes when you combine this with another new feature in Solaris 10 11/06. Zone migration. Imagine the following scenario.
  1. Mount a file system containing a company standard non-global zoneroot
  2. Attach the zone to the system (zonecfg create -a and zoneadm attach)
  3. Clone this new zone as many times as needed
  4. Detach the original zone from the server (zoneadm detach)
  5. Unmount the detached zoneroot filesystem
This sounds a lot like jumpstart and flasharchives, doesn't it ? You bet it does, and it has many of the same benefits. The flashzone (I'm making up this phrase) can be delivered via USB stick, NFS file services, network file copy (scp), embedded in a server flasharchive. The possibilities are very intriguing.

I hope that this has helped introduce you to a few new zones features with Solaris 10 11/06 (and one in OpenSolaris). As I ponder the combination of these new features I find myself beginning to think that a minimal global zone and cloned full root zones may in fact be a superior practice. We'll explore that in more detail soon.

Technocrati Tags:

Feb 16 2007, 02:53:06 PM CST Permalink Comments [0]

20060525 Thursday May 25, 2006
What's in a name? that which we call a zone
What's in a name? that which we call a zone
By any other name would virtualize as complete;


One of the most common questions raised during boot camps and other Solaris briefings deals with the subject of containers and zones. There seems to be some confusion as the terms appear to be used interchangeably. Yes, they are related - specifically a zone is a new type of container introduced in Solaris 10, but containers have their origins much earlier.

The 1913 Webster dictionary defines a Container as which provides the foundation of the Solaris container. Quite simply, a Solaris container is any method by which the resources of an application can be controlled (contained). I suppose the origins of the container could date back to the earliest days of Solaris 2 with the introduction of the processor_bind(2) system call and the pbind(1M) administrative command. These controls were somewhat cumbersome for all but specific workloads and a bit primitive to be called a container.

The container became a recognizable entity with the introduction of the Fair Share Scheduler (FSS) in the Solaris 2.6 timeframe. We had a new scheduler class and a relatively easy to use framework to label and control resource usage for complex applications. So we had a container (project), but it was an unbundled product - so not quite a Solaris container.

When did Solaris get a container ? When the Solaris Resource Manager (SWM) became bundled in Solaris 9. Every instance of Solaris had the capability to control resource usage of nearly every application. Why didn't we call it a container in Solaris 9 ? We only had one type of container (a project), so it wasn't really necessary to give it two different labels. With the introduction of Solaris 10, we have a new type of container, the Solaris zone.

Solaris zones are a virtualization technology that adds a security barrier around each user space instance. We now have two orthogonal application controls: security and resource limits. The name containers was introduced to describe both of these technologies.

So is a zone a container ? Absolutely. As are Solaris Resource Management projects and resource pools. And container technologies can be combined to provide several dimensions of application controls (virtualized user space object, resource caps, resource guarantees). Perhaps there will be other types of containers in the future, but for the moment we have three very interesting technologies that can all wear name container.

Technocrati Tags:

May 25 2006, 09:42:21 PM CDT Permalink Comments [0]

20060522 Monday May 22, 2006
To zone, or not to zone
To zone, or not to zone: that is the question:
Whether 'tis nobler in the mind of the administrator to suffer
The slings and arrows of outrageous utilization,
Or to take arms against a sea of application consolidations


One of the most interesting (and often hotly debated) questions raised while planning the adoption of Solaris 10 is when to deploy applications in zones. You can almost hear Howie Mandel asking: zone, or no zone? Some early adopters of Solaris 10 didn't includes zones in their Standard Operating Environment (SOE) certifications, preferring to consider their use later after the new OS environments have been deployed and their comfort level with Solaris 10 improved. There is wisdom in this approach, but perhaps the time is right to reconsider this question.

As with any new technology there are trade-offs that should be considered before committing to a course of action. In the case of Solaris Zones, the considerations aren't quite as complicated as they may seem - in fact they can be reduced to the following question
  1. Am I upgrading on existing hardware or installing on new hardware ?

  2. This is the most important question, for several reasons. If you are going to upgrade to Solaris 10 from a previous release and not change the hardware then the most efficient method is to use . Create a new boot environment, install Solaris 10 in the new set of disk slices, and let Live Upgrade manage all of the details of the upgrade (users, file systems, network settings, etc). The upgrade can occur with the applications are running in the current environment, so there is little impact. The previous Solaris environment can be quickly restored if problems are discovered in the new Solaris 10 installation, so the level of risk is minimized.

    At present, Live Upgrade is not supported on a system with local zones, but if you are coming from Solaris 8 or 9 you won't have local zones, so this restriction is rather moot. Conversely, if you are installing on new hardware then you won't be using Live Upgrade, at least not initially.

    So if you are upgrading on existing hardware then don't deploy zones initially. Perform the upgrade (using Live Upgrade) and once the new environment has settled down, start planning the migration of the existing applications into a zone, at a time that is convenient.

  3. Can the application run correctly in a local zone ?

  4. The first question considered the most efficient approach, but we still must consider the feasibility of running applications in zones. And there are a few considerations.

    Nonglobal zones have a reduced set of privileges that may cause some applications to fail. An example would be something like a DHCP server that requires raw IP access to communicate with systems that don't have IP addresses. Since this privilege doesn't exist in a local zone (at least until we get
    configurable privileges and per-zone IP stacks) then this type of application will not work in a local zone.

    Some applications that don't appear to work with nonglobal zones may work with a little bit of creativity. An example would be the NFS server - it does work in a nonglobal zone. But that doesn't mean that you can't share data from a nonglobal zone, you just have to use the NFS server in the global zone. Use a writable loopback filesystem between the global and nonglobal zone and share the directory using an NFS server in the global zone. Users in the nonglobal zone can modify and share data, just as if NFS server were running locally. Another example would be a backup client. It may be unnecessary to run a backup client in a nonglobal zone since all files are visible from the global zone. This can also be true for performance data collectors, and actually an interesting design goal for intrusion detection.

    And that's really about it. If the application can run in a nonglobal zone and it's convenient to do so, why not ? Let's hear the case of the single nonglobal zone arguments.


Technocrati Tags:

May 22 2006, 06:17:10 PM CDT Permalink Comments [0]

20060209 Thursday February 09, 2006
SMF manifest examples for Apache1 and MySQL
In the Service Management in a Day workshop (and the earlier Migrating a Legacy RC Service module from the Solaris Deep Dives) we examine the migration of MySQL from an RC script to a fully managed SMF service.

Why MySQL ? Well, it's a convenient way to point out that MySQL is included in Solaris 10. But the real reason is that it is rather simple and makes a great platform to show what SMF can really do for us - and it's certainly more than a one trick pony.

So let's set up MySQL and see where this goes. You will find the instructions in /etc/sfw/mysql/README.solaris.mysql, but be careful as there is a small error. The last time I looked, chmod -R requires two arguments, not one.

# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf


Let's start the database manually and make sure that all is well.

# /etc/sfw/mysql/mysql.server start
Starting mysqld daemon with databases from /var/mysql

# /usr/sfw/bin/mysqladmin status
Uptime: 32  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  
Flush tables: 1  Open tables: 0  Queries per second avg: 0.031


Time for the first SMF value - resilient services. Let's terminate mysqld and see what happens.

# pkill mysql
#  mysqladmin status
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)'
Check that mysqld is running and that the socket: '/tmp/mysql.sock' exists!


This is what we expect. When mysqld terminates, nobody is watching and it remains down until the next reboot (or transition back to run level 3).

So what can SMF do for me here ? Paying attention to a non-transient service is a good start.

What we need now is a manifest for MySQL. You can take a look at mine or if you follow the RC Service Migration howto then you will come up with something very close. Put mysql.xml somewhere in /var/svc/manifest (application or local seem a good place, local probably being the best choice). Reboot or run the manifest-import service method to make SMF aware of the new service definition

# svcs mysql
svcs: Pattern 'mysql' doesn't match any instances
STATE          STIME    FMRI

# /lib/svc/method/manifest-import
Loaded 1 smf(5) service descriptions

# svcadm enable mysql
# svcs mysql
STATE          STIME    FMRI
online         22:39:54 svc:/application/mysql:default

# mysqladmin status
Uptime: 4  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6
  Flush tables: 1  Open tables: 0  Queries per second avg: 0.250


Now, let's try that pkill thing again.

# pkill mysqld

# svcs mysql
STATE          STIME    FMRI
online         22:45:45 svc:/application/mysql:default
Now, if we watch the service log file which is convenientely located at /var/svc/log/application-mysql:default.log you will see svc.startd notice that all of the processes have terminated, yet it isn't a transient service. So there is a problem and the service should be restarted.

[ Feb  8 16:53:36 Stopping because all processes in service exited. ]
[ Feb  8 16:53:36 Executing stop method ("/etc/sfw/mysql/mysql.server stop") ]
No mysqld pid file found. Looked for /var/mysql/pandora.pid.
[ Feb  8 16:53:36 Method "stop" exited with status 0 ]
[ Feb  8 16:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ][ Feb  8 16:53:36 Method "start" exited with status 0 ]


This is pretty cool. We've made MySQL somewhat more available than it would have been straight out of the box. Does this eliminate the requirement for High Availability Clusters ? No, but it does open an interesting discussion.

In this example my observation of MySQL's availability is rather naive - if it's running it must be OK. For something like a database server you way want to connect and manipulate some tables to see if the service is really running. We should also note that SMF doesn't really handle the platform availability issues - so HA Clusters are still needed. But it's also interesting to note that many HA scripts only provide coverage for a subset of critical services, usually a database, but ignore the dozens of other services that are also required for proper operation of the service being clustered. Lacking a sophisticated dependency framework, a node failover occurs when one of these other services fails.

SMF provides such a framework, including the watchdog monitor (svc.startd) - and it does with really little effort on the part of the administrator or application packager.

But wait, there's more.

In a recent discussion over service minimization (the idea that you don't install software that you have no intention of running) a more subtle value of SMF can be observed. Solaris 10 allows us to separate the question of installation from activation. It's quite easy to install software and then verify that it is disabled. In fact a routine scan of service properties and a comparison against a baseline is a good idea.

Here is where a bit of creativity can be give us additional safeguards. A well developed SMF manifest will allow us to make an additional distinction. We can now observe the installation of a service, the configuration of a service, and whether or not that service has actually been activated.

How is this done ? A dependency on a configuration file is a good start. Let's look at the MySQL manifest and see how this was done.

<dependency
                name='config_file'
                type='path'
                grouping='require_all'
                restart_on='none'>
                <service_fmri value='file://localhost/var/mysql/my.cnf' />   
</dependency>


This is a dependency on a particular configuration file, in this case /var/sql/my.cnf. If this file is missing then then MySQL will not transition to online. If enabled it will immediately transition to the offline state and a check of svcs -l mysql will show the missing configuration file.

Now this is very cool indeed. For this service to be activated it must be installed, configured and enabled. Failing to configure the service (consider the case of sshd which you probably don't want to run without a configuration file) will provide an obvious and easily observed error condition. This may change the way you look at service minimization.

The takeaway from this exercise is that as you plan your RC service migrations to SMF, add a dependency on an easily observed indicatation that the service has been properly configured, such as a configuration file.

This brings me to my next example, an Apache 1 service manifest. We start by copying the Apache 2 service manifest at /var/svc/manifest/network/http-apache2.xml - as that seemed like a good place to start. I changed the service name, the documentation block, and the start/stop methods as before.

There's one new wrinkle - take a look at the following property group

<property_group name='httpd' type='application'>
                        <stability value='Evolving' />
                        <propval name='ssl' type='boolean' value='false' />
                </property_group>


If we look at the Apache2 start method /lib/svc/method/http-apache2, you will see a query for this service property


        ssl=`svcprop -p httpd/ssl svc:/network/http:apache2`
        if [ "$ssl" = false ]; then
                cmd="start"
        else
                cmd="startssl"
        fi
        ;;


So this is how we enable SSL support for Apache 2. If we want to do something similar for Apache 1 then we will have to modify the start script /etc/init.d/apache. The other solution would be to remove the property group from the manifest and modify the start definition to call either /etc/init.d/apache start or /etc/init.d/apache startssl.

After you import this new manifest, please remember to unlink the start and stop links from all run level directories (there's one start in rc3.d and one kill in each run level).

This brings me to my last recommendation - using a configuration file dependency to help keep service instances separated. This is particularly important for the http service as all the executables are named httpd. By adding a dependency to the configuration file you have added an important documentation item that will come in handy when diagnosing service failures. If the instance fails and ends up in a maintenance state, a quick look at svcs -l will tell you which instance you need to investigate.

Where can I learn more about this ? The OpenSolaris SMF Community would be a good place to look. In addition to the excellent articles on Solaris Service Management, there is a repostory of contributed manifests that might help you get started. And you are invited to contribute manifests for your converted services - you might even receive a nice OpenSolaris trinket for your efforts.

Technocrati Tags:

Feb 09 2006, 12:25:28 AM CST Permalink Comments [0]

20051109 Wednesday November 09, 2005
Common First Time Mistakes - Containers
Containers in Solaris 10 features an interesting virtualization technology called zones. Local zones are amazingly easy to configure and install, but there are a few things that can trip you up the first few times.
  1. Local zones require system identification
    Since each local zone has its own /etc, it can have a different identity from that of the global zone (timezone, locale, root password). So we need to supply some basic configuration information about the local zone. If you are experienced with Solaris you will recognize the system identification process that runs at first boot. Solaris 10 adds the NFS V4 domain mapping question which must be answered in addition to supplying an /etc/sysidconfig file. We'll deal with that later.

    The complication presented by local zones is that you can use zlogin(1) to enter a local zone before it has completed it's system identification. This is not possible for the global zone, nor a prior Solaris release - so you may not even consider this a possibility when diagnosing your first few zones configuration problems.

    The symptoms are that you can enter the zone using zlogin(1), but nothing else works. You cannot get in via ssh, rlogin, or telnet even through they have been configured properly (or so you believe at this point). Your first step in diagnosis should be an svcs -a and you will see service states of unitialized. This is the clue!

    If you look all the way back in the service states you will see there is a service called sysidtool (that calls the service script /lib/svc/method/sysidtool-system). This is where system identification is done (and if you look at the method you will discover how to answer the NFSV4 question).

    The resolution is simple - connect to the local zone console using zlogin -C and answer the identification questions.

    If you are using the Java Desktop System then terminal type 12 (xterms) will provide the best results.

    You will also experience this problem if your sysidconfig file contains an error. The most common errors are incorrect specifications of the timezone and root password.

  2. Failure to answer the NFS V4 question
    You can script the creation of a local zone and supply default identification through the use of an /etc/sysidconfig file. Experienced Solaris adminstrators will recognize this method from unattended jumpstart installs. Solaris 10 requires one additional configuration item that isn't satisfied by /etc/sysidconfig: the NFS V4 domain mapping question.

    Automating the NFS V4 configuration requires 2 steps. First, specify the value of NFSMAPID_DOMAIN in $zonepath/root/etc/default/nfs. Finally, you need to create a file called $zonepath/root/etc/.NFS4inst_state.domain to let sysidtool know that you have answered the question.

  3. The local zone root directory is $zonepath/root
    If the lab equipment is sufficiently fast then we have a little competition in the Containers workshop. The challenge is to completely automate the installation of a local container and provision an application (typically Apache or MySQL) as well as set up root access via telnet, rlogin, or ssh - but do it in a single script with no intervention. Run the script and the next step is to connect to the provisioned service.

    After 10 minutes of scripting work, and another 10 minutes for a local zone to install, there are always a few exclamation of "Doh" as the student realized that they dropped in a sysidconfig to $zonepath/etc rather than $zonepath/root/etc.

  4. Make sure the mountpoints exit for all file system being supplied by zoneadmd
    Supplying lofs file systems via the zone configuration file (see zonecfg man page) is a convenient way to share files between zones (including the global zone). The advantage of this method is that zoneadmd performs the loopback mount from the privileged global zone as it readies the local zone, thus the local zone isn't permitted to undo this mount. If the mount point (in the local zone) does not exist then the zone will fail to boot.

  5. Network not being plumbed will cause a local zone to fail to boot
    This one is rather unique to the mobile user. For convenience you may have your global zone boot without networking configured. Once you log in then you can run a simple script to plumb up your network interfaces based on how you need to connect (fixed IP address at home or in a lab, DHCP in a hotel, etc). If your local zones have network resources, which is typically the case, the network interfaces must be plumbed up in the global zone prior to booting the local zone. This one has gotten me more than once in a customer demonstration.


Technocrati Tags:

Nov 09 2005, 11:41:24 PM CST Permalink Comments [4]