Bob Netherton's Weblog
Bob Netherton's Weblog
Search

Archives
Click me to subscribe Get OpenSolaris

follow throatwarbler at http://twitter.com
Solaris Materials
Recommended Bloggers
Personal


Blog Info and Maint
 

Today's Page Hits: 12

Main | Next page »
20080313 Thursday March 13, 2008
Question of the Day - March 13, 2008
When a girl floats, why do they still call it buoyancy ?

Mar 13 2008, 08:57:47 AM CDT Permalink Comments [1]

20080312 Wednesday March 12, 2008
It was 30 years ago today......
It is hard to believe, but it was 30 years ago today (March 12, 1978) when the Rock in Opposition (RIO) festival was held at the New London Theatre. Hosted by Henry Cow, the lineup included Later festivals would include Art Zoyd (France), Aksak Maboul (Belgium), and Art Bears (replacing the then defunct Henry Cow), but it was this one chance meeting of the five original bands that would change the face of avant/progressive rock forever.

Samla Mammas Manna and Univers Zero are still active bands and have been known to headline various progressive rock festivals around the world. Etron Fou Leloublan is long gone, but it's spirit continues with Volapük.

Mar 12 2008, 06:01:59 PM CDT Permalink Comments [0]

20080218 Monday February 18, 2008
ZFS and FMA - Two great tastes .....
Our good friend Isaac Rozenfeld talks about the Multiplicity of Solaris. When talking about Solaris I will use the phrase "The Vastness of Solaris". If you have attended a Solaris Boot Camp or Tech Day in the last few years you get an idea of what we are talking about - when we go on about Solaris hour after hour after hour.

But the key point in Isaac's multiplicity discussion is how the cornucopia of Solaris features work together to do some pretty spectacular (and competitively differentiating) things. In the past we've looked at combinations such as ZFS and Zones or Service Management, Role Based Access Control (RBAC) and Least Privilege. Based on a conversation last week in St. Louis, let's consider how ZFS and Solaris Fault Management (FMA) play together.

Preparation

Let's begin by creating some fake devices that we can play with. I don't have enough disks on this particular system, but I'm not going to let that slow me down. If you have sufficient real hot swappable disks, feel free to use them instead.
# mkfile 1g /dev/disk1
# mkfile 1g /dev/disk2
# mkfile 512m /dev/disk3
# mkfile 512m /dev/disk4
# mkfile 1g /dev/disk5

Now let's create a couple of zpools using the fake devices. pool1 will be a 1GB mirrored pool using disk1 and disk2. pool2 will be a 512MB mirrored pool using disk3 and disk4. Device spare1 will spare both pools in case of a problem - which we are about to inflict upon the pools.
# zpool create pool1 mirror disk1 disk2 spare spare1
# zpool create pool2 mirror disk3 disk4 spare spare1
# zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

So far so good. If we were to run a scrub on either pool, it will complete immediately. Remember that unlike hardware RAID disk replacement, ZFS scrubbing and resilvering only touches blocks that contain actual data. Since there is no data in these pools (yet), there is little for the scrubbing process to do.
# zpool scrub pool1
# zpool scrub pool2
# zpool status
  pool: pool1
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 09:24:16 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 09:24:17 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

Let's populate both pools with some data. I happen to have a directory of scenic images that I use as screen backgrounds - that will work nicely.

# cd /export/pub/pix>
# find scenic -print | cpio -pdum /pool1
# find scenic -print | cpio -pdum /pool2

# df -k | grep pool
pool1                1007616  248925  758539    25%    /pool1
pool2                 483328  248921  234204    52%    /pool2

And yes, cp -r would have been just as good.

Problem 1: Simple data corruption

Time to inflict some harm upon the pool. First, some simple corruption. Writing some zeros over half of the mirror should do quite nicely.
# dd if=/dev/zero of=/dev/dsk/disk1 bs=8192 count=10000 conv=notrunc
10000+0 records in
10000+0 records out 

At this point we are unaware that anything has happened to our data. So let's try accessing some of the data to see if we can observe ZFS self healing in action. If your system has plenty of memory and is relatively idle, accessing the data may not be sufficient. If you still end up with no errors after the cpio, try a zpool scrub - that will catch all errors in the data.
# cd /pool1
# find . -print | cpio -ov > /dev/null
416027 blocks

Let's ask our friend fmstat(1m) if anything is wrong ?
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.1   0   0     0     0      0      0
disk-transport           0       0  0.0  366.5   0   0     0     0    32b      0
eft                      0       0  0.0    2.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       1       0  0.0    0.2   0   0     0     0      0      0
io-retire                0       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             1       0  0.0   16.0   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  620.3   0   0     0     0      0      0
syslog-msgs              1       0  0.0    9.7   0   0     0     0      0      0
zfs-diagnosis          162     162  0.0    1.5   0   0     1     0   168b   140b
zfs-retire               1       1  0.0  112.3   0   0     0     0      0      0

As the guys in the Guinness commercial say, "Brilliant!" The important thing to note here is that the zfs-diagnosis engine has run several times indicating that there is a problem somewhere in one of my pools. I'm also running this on Nevada so the zfs-retire engine has also run, kicking in a hot spare due to excessive errors.

So which pool is having the problems ? We continue our FMA investigation to find out.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e  ZFS-8000-GH    Major    

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.


# zpool status -x
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress, 44.83% done, 0h0m to go
config:

        NAME          STATE     READ WRITE CKSUM
        pool1         DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              disk1   DEGRADED     0     0   162  too many errors
              spare1  ONLINE       0     0     0
            disk2     ONLINE       0     0     0
        spares
          spare1      INUSE     currently in use

errors: No known data errors

This tells us all that we need to know. The device disk1 was found to have quite a few checksum errors - so many in fact that it was replaced automatically by a hot spare. The spare was resilvering and a full complement of data replicas would be available soon. The entire process was automatic and completely observable.

Since we inflicted harm upon the (fake) disk device ourself, we know that it is in fact quite healthy. So we can restore our pool to its original configuration rather simply - by detaching the spare and clearing the error. We should also clear the FMA counters and repair the ZFS vdev so that we can tell if anything else is misbehaving in either this or another pool.
# zpool detach pool1 spare1
# zpool clear pool
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 10:25:26 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors


# fmadm reset zfs-diagnosis
# fmadm reset zfs-retire
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  223.5   0   0     0     0    32b      0
eft                      1       0  0.0    4.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       4       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             4       0  0.0    8.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  372.7   0   0     0     0      0      0
syslog-msgs              4       0  0.0    5.4   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    1.4   0   0     0     0      0      0
zfs-retire               0       0  0.0    0.0   0   0     0     0      0      0


# fmdump -v -u d82d1716-c920-6243-e899-b7ddd386902e
TIME                 UUID                                 SUNW-MSG-ID
Feb 18 09:51:49.3025 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
  100%  fault.fs.zfs.vdev.checksum

        Problem in: 
           Affects: zfs://pool=pool1/vdev=449a3328bc444732
               FRU: -
          Location: -

# fmadm repair zfs://pool=pool1/vdev=449a3328bc444732
fmadm: recorded repair to zfs://pool=pool1/vdev=449a3328bc444732

# fmadm faulty

Problem 2: Device failure

Time to do a little more harm. In this case I will simulate the failure of a device by removing the fake device. Again we will access the pool and then consult fmstat to see what is happening (are you noticing a pattern here????).
# rm -f /dev/dsk/disk2
# cd /pool1
# find . -print | cpio -oc > /dev/null
416027 blocks

# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  214.2   0   0     0     0    32b      0
eft                      1       0  0.0    4.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       4       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             4       0  0.0    8.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  372.7   0   0     0     0      0      0
syslog-msgs              4       0  0.0    5.4   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    1.4   0   0     0     0      0      0
zfs-retire               0       0  0.0    0.0   0   0     0     0      0      0

Rats, the find ran totally out of cache from the last example. As before, should this happen,proceed directly to zpool scrub.
# zpool scrub pool1
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  190.5   0   0     0     0    32b      0
eft                      1       0  0.0    4.1   0   0     0     0   1.4M      0
fmd-self-diagnosis       5       0  0.0    0.5   0   0     0     0      0      0
io-retire                1       0  0.0    1.0   0   0     0     0      0      0
snmp-trapgen             6       0  0.0    7.4   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  329.0   0   0     0     0      0      0
syslog-msgs              6       0  0.0    4.6   0   0     0     0      0      0
zfs-diagnosis           16       1  0.0   70.3   0   0     1     1   168b   140b
zfs-retire               1       0  0.0  509.8   0   0     0     0      0      0

Again, hot sparing has kicked in automatically. The evidence of this is the zfs-retire engine running.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 11:07:29 50ea07a0-2cd9-6bfb-ff9e-e219740052d5  ZFS-8000-D3    Major    
Feb 18 11:16:43 06bfe323-2570-46e8-f1a2-e00d8970ed0d

Fault class : fault.fs.zfs.device

Description : A ZFS device failed.  Refer to http://sun.com/msg/ZFS-8000-D3 for
              more information.

Response    : No automated response will occur.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

# zpool status -x
  pool: pool1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver in progress, 4.94% done, 0h0m to go
config:

        NAME          STATE     READ WRITE CKSUM
        pool1         DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            disk1     ONLINE       0     0     0
            spare     DEGRADED     0     0     0
              disk2   UNAVAIL      0     0     0  cannot open
              spare1  ONLINE       0     0     0
        spares
          spare1      INUSE     currently in use

errors: No known data errors

As before, this tells us all that we need to know. A device (disk2) has failed and is no longer in operation. Sufficient spares existed and one was automatically attached to the damaged pool. Resilvering completed successfully and the data is once again fully mirrored.

But here's the magic. Let's repair the device - again simulated with our fake device.
# mkfile 1g /dev/dsk/disk2
# zpool repair pool1 disk2
# zpool status pool1 
  pool: pool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 4.86% done, 0h1m to go
config:

        NAME               STATE     READ WRITE CKSUM
        pool1              DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            disk1          ONLINE       0     0     0
            spare          DEGRADED     0     0     0
              replacing    DEGRADED     0     0     0
                disk2/old  UNAVAIL      0     0     0  cannot open
                disk2      ONLINE       0     0     0
              spare1       ONLINE       0     0     0
        spares
          spare1           INUSE     currently in use

errors: No known data errors

Get a cup of coffee while the resilvering process runs.
# zpool status
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   


# fmadm faulty

Notice the nice integration with FMA. Not only was the new device resilvered, but the hot spare was detached and the FMA fault was cleared. The fmstat counters still show that there was a problem and the fault report still existes in the fault log for later interrogation.
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  171.5   0   0     0     0    32b      0
eft                      1       0  0.0    3.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       6       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    0.9   0   0     0     0      0      0
snmp-trapgen             6       0  0.0    6.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  294.3   0   0     0     0      0      0
syslog-msgs              6       0  0.0    4.2   0   0     0     0      0      0
zfs-diagnosis           36       1  0.0   51.6   0   0     0     1      0      0
zfs-retire               1       0  0.0  170.0   0   0     0     0      0      0

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Feb 16 11:38:16.0976 48935791-ff83-e622-fbe1-d54c20385afc ZFS-8000-GH
Feb 16 11:38:30.8519 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233 ZFS-8000-GH
Feb 18 09:51:49.3025 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713 ZFS-8000-GH
Feb 18 09:56:24.8029 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
Feb 18 10:23:07.2228 7c04a6f7-d22a-e467-c44d-80810f27b711 ZFS-8000-GH
Feb 18 10:25:14.6429 faca0639-b82b-c8e8-c8d4-fc085bc03caa ZFS-8000-GH
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3
Feb 18 11:16:44.2497 06bfe323-2570-46e8-f1a2-e00d8970ed0d ZFS-8000-D3


# fmdump -V -u 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
TIME                 UUID                                 SUNW-MSG-ID
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3

  TIME                 CLASS                                 ENA
  Feb 18 11:07:27.8476 ereport.fs.zfs.vdev.open_failed       0xb22406c635500401

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
        code = ZFS-8000-D3
        diag-time = 1203354449 236999
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = Dimension XPS                
                        chassis-id = 7XQPV21
                        server-id = arrakis
                (end authority)

                mod-name = zfs-diagnosis
                mod-version = 1.0
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.fs.zfs.device
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x3a2ca6bebd96cfe3
                        vdev = 0xedef914b5d9eae8d
                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x3a2ca6bebd96cfe3
                        vdev = 0xedef914b5d9eae8d
                (end resource)

        (end fault-list[0])

        fault-status = 0x3
        __ttl = 0x1
        __tod = 0x47b9bb51 0x1ef7b430

# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset

# fmadm reset zfs-retire
fmadm: zfs-retire module has been reset

Problem 3: Unrecoverable corruption

For those of you that have attended one of my Boot Camps or Solaris Best Practices training classes know, House is one of my favorite TV shows - the only one that I watch regularly. And this next example would make a perfect episode. Is it likely to happen ? No, but it is so cool when it does :-)

Remember our second pool, pool2. It has the same contents as pool1. Now, let's do the unthinkable - let's corrupt both halves of the mirror. Surely data loss will follow, but the fact that Solaris stays up and running and can report what happened is pretty spectacular. But it gets so much better than that.
# dd if=/dev/zero of=/dev/dsk/disk3 bs=8192 count=10000 conv=notrunc
# dd if=/dev/zero of=/dev/dsk/disk4 bs=8192 count=10000 conv=notrunc
# zpool scrub pool2

# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  166.0   0   0     0     0    32b      0
eft                      1       0  0.0    3.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       6       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    0.9   0   0     0     0      0      0
snmp-trapgen             8       0  0.0    6.3   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  294.3   0   0     0     0      0      0
syslog-msgs              8       0  0.0    3.9   0   0     0     0      0      0
zfs-diagnosis         1032    1028  0.6   39.7   0   0    93     2    15K    13K
zfs-retire               2       0  0.0  158.5   0   0     0     0      0      0

As before, lots of zfs-diagnosis activity. And two hits to zfs-retire. But we only have one spare - this should be interesting. Let's see what is happenening.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e  ZFS-8000-GH    Major    
Feb 18 13:18:42 c3889bf1-8551-6956-acd4-914474093cd7

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 16 11:38:30 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233  ZFS-8000-GH    Major    
Feb 18 09:51:49 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713
Feb 18 10:23:07 7c04a6f7-d22a-e467-c44d-80810f27b711
Feb 18 13:18:42 0a1bf156-6968-4956-d015-cc121a866790

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

# zpool status -x
  pool: pool2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:

        NAME          STATE     READ WRITE CKSUM
        pool2         DEGRADED     0     0 2.60K
          mirror      DEGRADED     0     0 2.60K
            spare     DEGRADED     0     0 2.43K
              disk3   DEGRADED     0     0 5.19K  too many errors
              spare1  DEGRADED     0     0 2.43K  too many errors
            disk4     DEGRADED     0     0 5.19K  too many errors
        spares
          spare1      INUSE     currently in use

errors: 247 data errors, use '-v' for a list

So ZFS tried to bring in a hot spare, but there were insufficient replicas to be able to reconstruct all of the data. But here is where is gets interesting. Let's see what zpool status -v says about things.
zpool status -v
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    INUSE     in use by pool 'pool2'

errors: No known data errors

  pool: pool2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:

        NAME          STATE     READ WRITE CKSUM
        pool2         DEGRADED     0     0 2.60K
          mirror      DEGRADED     0     0 2.60K
            spare     DEGRADED     0     0 2.43K
              disk3   DEGRADED     0     0 5.19K  too many errors
              spare1  DEGRADED     0     0 2.43K  too many errors
            disk4     DEGRADED     0     0 5.19K  too many errors
        spares
          spare1      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        /pool2/scenic/cider mill crowds.jpg
        /pool2/scenic/Cleywindmill.jpg
        /pool2/scenic/csg_Landscapes001_GrandTetonNationalPark,Wyoming.jpg
        /pool2/scenic/csg_Landscapes002_ElowahFalls,Oregon.jpg
        /pool2/scenic/csg_Landscapes003_MonoLake,California.jpg
        /pool2/scenic/csg_Landscapes005_TurretArch,Utah.jpg
        /pool2/scenic/csg_Landscapes004_Wildflowers_MountRainer,Washington.jpg
        /pool2/scenic/csg_Landscapes!idx011.jpg
        /pool2/scenic/csg_Landscapes127_GreatSmokeyMountains-NorthCarolina.jpg
        /pool2/scenic/csg_Landscapes129_AcadiaNationalPark-Maine.jpg
        /pool2/scenic/csg_Landscapes130_GettysburgNationalPark-Pennsylvania.jpg
        /pool2/scenic/csg_Landscapes131_DeadHorseMill,CrystalRiver-Colorado.jpg
        /pool2/scenic/csg_Landscapes132_GladeCreekGristmill,BabcockStatePark-WestVirginia.jpg
        /pool2/scenic/csg_Landscapes133_BlackwaterFallsStatePark-WestVirginia.jpg
        /pool2/scenic/csg_Landscapes134_GrandCanyonNationalPark-Arizona.jpg
        /pool2/scenic/decisions decisions.jpg
        /pool2/scenic/csg_Landscapes135_BigSur-California.jpg
        /pool2/scenic/csg_Landscapes151_WataugaCounty-NorthCarolina.jpg
        /pool2/scenic/csg_Landscapes150_LakeInTheMedicineBowMountains-Wyoming.jpg
        /pool2/scenic/csg_Landscapes152_WinterPassage,PondMountain-Tennessee.jpg
        /pool2/scenic/csg_Landscapes154_StormAftermath,OconeeCounty-Georgia.jpg
        /pool2/scenic/Brig_Of_Dee.gif
        /pool2/scenic/pvnature14.gif
        /pool2/scenic/pvnature22.gif
        /pool2/scenic/pvnature7.gif
        /pool2/scenic/guadalupe.jpg
        /pool2/scenic/ernst-tinaja.jpg
        /pool2/scenic/pipes.gif
        /pool2/scenic/boat.jpg
        /pool2/scenic/pvhawaii.gif
        /pool2/scenic/cribgoch.jpg
        /pool2/scenic/sun1.gif
        /pool2/scenic/sun1.jpg
        /pool2/scenic/sun2.jpg
        /pool2/scenic/andes.jpg
        /pool2/scenic/treesky.gif
        /pool2/scenic/sailboatm.gif
        /pool2/scenic/Arizona1.jpg
        /pool2/scenic/Arizona2.jpg
        /pool2/scenic/Fence.jpg
        /pool2/scenic/Rockwood.jpg
        /pool2/scenic/sawtooth.jpg
        /pool2/scenic/pvaptr04.gif
        /pool2/scenic/pvaptr07.gif
        /pool2/scenic/pvaptr11.gif
        /pool2/scenic/pvntrr01.jpg
        /pool2/scenic/Millport.jpg
        /pool2/scenic/bryce2.jpg
        /pool2/scenic/bryce3.jpg
        /pool2/scenic/monument.jpg
        /pool2/scenic/rainier1.gif
        /pool2/scenic/arch.gif
        /pool2/scenic/pv-anzab.gif
        /pool2/scenic/pvnatr15.gif
        /pool2/scenic/pvocean3.gif
        /pool2/scenic/pvorngwv.gif
        /pool2/scenic/pvrmp001.gif
        /pool2/scenic/pvscen07.gif
        /pool2/scenic/pvsltd04.gif
        /pool2/scenic/banhall28600-04.JPG
        /pool2/scenic/pvwlnd01.gif
        /pool2/scenic/pvnature08.gif
        /pool2/scenic/pvnature13.gif
        /pool2/scenic/nokomis.jpg
        /pool2/scenic/lighthouse1.gif
        /pool2/scenic/lush.gif
        /pool2/scenic/oldmill.gif
        /pool2/scenic/gc1.jpg
        /pool2/scenic/gc2.jpg
        /pool2/scenic/canoe.gif
        /pool2/scenic/Donaldson-River.jpg
        /pool2/scenic/beach.gif
        /pool2/scenic/janloop.jpg
        /pool2/scenic/grobacro.jpg
        /pool2/scenic/fnlgld.jpg
        /pool2/scenic/bells.gif
        /pool2/scenic/Eilean_Donan.gif
        /pool2/scenic/Kilchurn_Castle.gif
        /pool2/scenic/Plockton.gif
        /pool2/scenic/Tantallon_Castle.gif
        /pool2/scenic/SouthStockholm.jpg
        /pool2/scenic/BlackRock_Cottage.jpg
        /pool2/scenic/seward.jpg
        /pool2/scenic/canadian_rockies_csg110_EmeraldBay.jpg
        /pool2/scenic/canadian_rockies_csg111_RedRockCanyon.jpg
        /pool2/scenic/canadian_rockies_csg112_WatertonNationalPark.jpg
        /pool2/scenic/canadian_rockies_csg113_WatertonLakes.jpg
        /pool2/scenic/canadian_rockies_csg114_PrinceOfWalesHotel.jpg
        /pool2/scenic/canadian_rockies_csg116_CameronLake.jpg
        /pool2/scenic/Castilla_Spain.jpg
        /pool2/scenic/Central-Park-Walk.jpg
        /pool2/scenic/CHANNEL.JPG



In my best Hugh Laurie voice trying to sound very Northeastern American, that is so cool! But we're not even done yet. Let's take this list of files and restore them - in this case, from pool1. Operationally this would be from a back up tape or nearline backup cache, but for our purposes, the contents in pool1 will do nicely.

First, let's clear the zpool error counters and return the spare disk. We want to make sure that our restore works as desired. Oh, and clear the FMA stats while we're at it.
# zpool clear
# zpool detach pool2 spare1

# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset

# fmadm reset zfs-retire   
fmadm: zfs-retire module has been reset

Now individually restore the files that have errors in them and check again. You can even export and reimport the pool and you will find a very nice, happy, and thoroughly error free ZFS pool. Some rather unpleasant gnashing of zpool status -v output with awk has been omitted for sanity sake.
# zpool scrub pool2
# zpool status pool2
  pool: pool2
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 14:04:56 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

# zpool export pool2
# zpool import pool2
# dircmp -s /pool1 /pool2

Conclusions and Review

So what have we learned ? ZFS and FMA are two great tastes that taste great together. No, that's chocolate and peanut butter, but you get this idea. One more great example of Isaac's Multiplicity of Solaris.

That, and I have finally found a good lab exercise for the FMA training materials. Ever since Christine Tran put the FMA workshop together, we have been looking for some good FMA lab exercises. The materials reference a synthetic fault generator that is not available in public (for obvious reasons). I haven't explored the FMA test harness enough to know if there is anything in there that would make a good lab. But this exercise that we have just explored seems to tie a number of key pieces together.

And of course, one more reason why Roxy says, "You should run Solaris."

Technocrati Tags:

Feb 18 2008, 02:30:49 PM CST Permalink Comments [3]

20080217 Sunday February 17, 2008
And Trixie says .....

Get rid of Roxy.....



And make me the peanut butter and banana sandwich!




Feb 17 2008, 10:14:24 PM CST Permalink Comments [0]

Roxy says "You Should Run Solaris"

Look into my eyes.
You want to run Solaris.
You want to run Solaris.
You want to run Solaris.

Repeat after me
You want to run Solaris.

And make me a peanut butter and banana sandwich!



Feb 17 2008, 04:49:14 PM CST Permalink Comments [0]

20071003 Wednesday October 03, 2007
LIve Upgrade from Solaris 10 11/06 to 8/07 without nonglobal zones
Live Upgrade is one of the most useful Solaris features, yet in my travels around the US I still don't see it used as much as I would like. I can think of several reasons for this - not all of them totally valid And I'm sure there are other reasons, but these are the ones I hear most often.

Let's turn our attention to the topic at hand, upgrading a Solaris 10 11/06 system to 8/07, without zones. This example will be on an x64 system, but the SPARC approach is simular.

If you have read my earlier blog on Live Upgrade, you will recall the process is
  1. Read Infodoc Infodoc 72099 and install any required patches
  2. Install the LU packages SUNWluu SUNWlur and SUNWlucfg (if present) from the installation media
  3. lurename(1m) if you want to change the name of your new boot environment
  4. lumake(1m) or ludelete(1m) + lucreate(1m) to repopulate the target boot environment with the proper software and configuration files
  5. luupgrade(1m) to upgrade the target boot environment
  6. luactivate(1m) to activate the new boot environment
  7. init 0 to perform the file synchronization and conversions, create the new boot archive and update your GRUB menu


So I fire up my web browser and run over to SunSolve to pick up Infodoc 72099 and see a rather large set of patches. And there are two lists, one for systems with non-global zones and one without. Since we're looking at a system without non-global zones we will start with the shorter of the two lists (the next article will cover systems with nonglobal zones).

Apparently we need patches
	 
Solaris 10 	x86 	118816-03 or higher 	nawk patch 	 
Solaris 10 	x86 	120901-03 or higher 	libzonecfg patch 	 
Solaris 10 	x86 	121334-04 or higher 	SUNWzoneu required patch 	 
Solaris 10 	x86 	119255-42 or higher 	patchadd/patchrm patches 	 
Solaris 10 	x86 	119318-01 or higher 	SVr4 Packaging Commands (usr) Patch 	 
Solaris 10 	x86 	117435-02 or higher 	biosdev patch for GRUB Boot 	 

Reboot after installation 	 

Solaris 10 	x86 	120236-01 or higher 	SUNWluzone required patches 	 
Solaris 10 	x86 	121429-08 or higher 	SUNWluzone required patches 	 
Solaris 10 	x86 	121003-03 or higher 	pax patch 	 
Solaris 10 	x86 	123122-02 or higher 	prodreg patch 	 
Solaris 10 	x86 	121005-03		sh patch 	 
Solaris 10 	x86 	119043-10		/usr/sbin/svccfg patch 	 
Solaris 10 	x86 	121902-02		i.manifest r.manifest class action script patch 	 
Solaris 10 	x86 	120901-03		libzonecfg patch 	 
Solaris 10 	x86 	120069-03		telnet security patch 	 
Solaris 10 	x86 	120070-02		cpio patch 	 
Solaris 10 	x86 	123333-01		tftp patch


Hmmm, seems like a lot of patches and a required reboot! So I fire up our new friend updatemanager to patch my system. I see that there is a new updatemanager patch available (121119-13), so I installed that one all by itself and restarted updatemanager.

I soon realize that my choice of patching tools is making this a bit challenging. Users of patch tools such as Patch Check Advanced(PCA) may have an easier time, but I was determined to do this with updatemanager, with occasional help from the patch READMEs in SunSolve.

The list of patches required for this upgrade applies to any release of Solaris 10. A fresh install of a Solaris 10 11/06 system only needed the following four patches - which is a lot better than I first thought.
	 
119255-42	 
121429-08
126539-01 as it replaces the required 121902-02
125419-01 as it replaces the required 120069-03
The difficulty with updatemanager was with the set of obsoleted patches. Something like the required 121902-02 that was obsoleted by 126539-01 which was installed took a bit of manual trolling through patch READMEs. So I'll save you the research - it came down to only the four above patches.

One important note: the required reboot after patch 117435-02 wasn't needed after all - so I'll try to save all of you Solaris 10 11/06 users one reboot. While I have your attention, it is a good idea, if not a best practice, to install patch and packaging patches separately.

Feeling a lot better about this process, I proceed and install the four required patches using updatemanager in two steps (119255-42 and then the other three patches) and all succeeded, as expected. All that was left to do was finish the standard procedure
# mount -o ro -F hsfs `lofiadm -a /export/iso/s10u4/solarisdvd.iso` /mnt 
# pkgadd -d /mnt/Solaris_10/Product SUNWlur SUNWluu SUNWlucfg 
# lurename -e nv71 -n s10u4 
# lumake -n s10u4 
# luupgrade -u -s /mnt -n s10u4 
# luactivate s10u4 
# init 0 


And all went as expected. Next time I will tackle the longer list of patches and examine the same upgrade path, but with nonglobal zones.

Technocrati Tags:

Oct 03 2007, 03:08:09 PM CDT Permalink Comments [4]

20070621 Thursday June 21, 2007
Updated Solaris Bootcamp Presentations
I've had a great time traveling around the country talking about Solaris. It's not exactly a difficult thing - there's plenty to talk about. Many of you have asked for copies of the latest Solaris update, virtualization overview and ZFS deep dive. Rather than have you dig through a bunch of old blog entries about bootcamps from 2005, here they are for your convenience.



I hope this will save you some digging though http://mediacast.sun.com and tons of old blogs.

In a few weeks I'll post a new "What's New in Solaris" which will have some really cool things. But we'll save that for later.

Technocrati Tags:

Jun 21 2007, 02:43:30 PM CDT Permalink Comments [0]

20070611 Monday June 11, 2007
True Virtualization ?
While this is inspired by a recent conversation with a customer, I have seen the term "true virtualization" used quite a bit lately - mostly by people who have just attended a VMware seminar, and to a lesser extend folks from IBM trying to compare LPARS with Solaris zones. While one must give due credit to the fine folks at VMware for raising Information Technology (IT) awareness and putting virtualization in the common vocabulary, they hardly have cornered the market on virtualization and using the term "true virtualization" may reveal how narrow an understanding they have of the concept or an unfortunate arrogance that their approach is the only one that matters.

Wikipedia defines virtualization as a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications, or end users interact with those resources. While Wikipedia isn't the final authority, this definition is quite good and we will use it to start our exploration.

So what is true virtualization ? Anything that (potentially) hides architectural details from running objects (programs, services, operating systems, data). No more, no less - end of discussion.

Clearly VMware's virtualization products (ESX, Workstation) do that. They provide virtual machines that emulate the Intel x86 Instruction Set Architecture (ISA) so that operating systems think they are running on real hardware when in fact they are not. This type of virtualization would be classified as an abstraction type of virtual machines. But so is Xen, albeit with an interesting twist. In the case of Xen, a synthetic ISA based on the x86 is emulated removing some of the instructions that are difficult to virtualize. This makes porting a rather simple task - none of the user space code needs to be modified and the privileged code is generally limited to parts of the kernel that actually touch the hardware (virtual memory management, device drivers). In some respects, Xen is less of an abstraction as it does allow the virtual machines to see the architectural details thus permitting specific optimizations to occur that would be prohibited in the VMware case. And our good friends at Intel and AMD are adding new features to their processors to make virtualization less complicated and higher performance so the differences in approach between the VMware and Xen hypervisors may well blur over time.

But is this true virtualization ? No, it is just one of many types of virtualization.

How about the Java Virtual Machine (JVM) ? It is a run time executive that provides a virtualized environment for a completely synthetic ISA (although real pcode implementations have been done, they are largely for embedded systems). This is the magic behind write once and run anywhere and in general the approach works very well. So this is another example of virtualization - and also an abstraction type. And given the number of JVMs running around out there - if anyone is going to claim true virtualization, it would be the Java folks. Fortunately their understanding of the computer industry is broad and they are not arrogant - thus they would never suggest such folly.

Sun4v Logical Domains (LDOMs) are a thin hypervisor based partitioning of a radically multithreaded SPARC processor. The guest domains (virtual machines) run on real hardware but generally have no I/O devices. These guest domains get their I/O over a private channel from a service domain (a special type of domain that owns devices and contains the real device drivers). So I/O is virtualized but all other operations are executed on real hardware. The hypervisor provides resource (CPU and memory) allocation and management and the private channels for I/O (including networking). This too is virtualization, but not like Xen or VMware. This is an example of partitioning. Another example is IBM (Power) LPARS albeit with a slightly different approach.

Are there other types of virtualization ? Of course there are.

Solaris zones are an interesting type of virtualization called OS Virtualization. In this case we interpose the virtualization layer between the privileged kernel layer the non-privileged user space. The benefit here is that all user space objects (name space, processes, address spaces) are completely abstracted and isolated. Unlike the methods previously discussed, the kernel and underlying hardware resources are not artificially limited, so the full heavy lifting capability of the kernel is available to all zones (subject to other resource management policies). The trade-off for this capability is that all zones share a common kernel. This has some availability and flexibility limitations that should be considered in a system design using zones. Non-native (Branded) zones offers some interesting flexibilities that we are just now beginning to exploit, so the future of this approach is very bright indeed. And if I read my competitors announcements correctly, even our good friends at IBM are embracing this approach with future releases of AIX. So clearly there is something to this thing called OS Virtualization.

And there are other approaches as well - hybrids of the types we have been discussing. Special purpose libraries that either replace or interpose between common system libraries can provide some very nice virtualization capabilities - some of these transparent to applications, some not. The open source project Wine is a good example of this. User mode Linux and it's descendants offer some abilities to run an operating system as user mode program, albeit not particularly efficiently.

QEMU is an interesting general purpose ISA simulator/translator that can be used to host non-native operating systems (such as Windows while running Solaris or Linux). The interesting thing about QEMU is that you can strip out the translation features with a special kernel module (kqemu) and the result is very efficient and nicely performing OS hosting (essentially simulating x86 running on x86). Kernel-based Virtual Machines (KVM) extends the QEMU capability to add yet another style of virtualization to Linux. It is not entirely clear at present whether KVM is really a better idea or just another not invented here (NIH) Linux project. Time will tell, but it would have been nice for the Linux kernel maintainers to take a page from OpenSolaris and embrace an already existing project that had some non-Linux vendor participation (*BSD, Solaris, Plan 9, plus some mainstream Linux distributions). At the very least it is confusing as most experienced IT professionals will associate KVM with Keyboard Video and Mouse switching products. There are other commercial products such as QuickTransit that use a similar approach (ISA translation).

And there are many many more.

So clearly the phrase "true virtualization" has no common or useful meaning. Questioning the application or definition of the phrase will likely uncover a predisposition or bias that might be a good starting point to carry on an interesting dialog. And that's always a good idea.

I leave you with one last thought. It is probably human nature to seek out the one uniform solution to all of our problems, the Grand Unification Theory being a great example. But in general, be skeptical of one size fits all approaches - while they may in fact fit all situations, they are generally neither efficient nor flattering. What does this have to do with virtualization ? Combining various techniques quite often will yield spectacular results. In other words, don't think VMware vs Zones - think VMware and Zones. In fact if you think Solaris, don't even think about zones, just do zones. If you need the additional abstraction to provide flexibility (heterogeneous or multiple version OS support) then use VMware or LDOMs. And zones.

Next time we'll take a look at abstraction style virtualization techniques and see if we can develop a method of predicting the overhead that each technique might impose on a system. Since a good apples to apples benchmark is not likely to ever see the light of day, perhaps some good old fashioned reasoning can help us make sense of what information we can find.

Technocrati Tags:

Jun 11 2007, 05:12:39 PM CDT Permalink Comments [0]

20070326 Monday March 26, 2007
Securing MySQL using SMF - the Ultimate Manifest
The best way to learn the Solaris Service Management Facility (SMF) is to migrate a legacy service. The version of MySQL that comes with Solaris is an ideal application. It is relatively simple, has few dependencies, and can be done in just a few quick edits of an existing manifest (utmp would be a good starting template). We cover the basic process in the SMF Deep Dive and various people have contributed manifests to OpenSolaris and Blastwave. While these are good illustrations of how easy the process is, few show what SMF can really do. The motivation for this how-to came from a recent Solaris Bootcamp attendee who asked "what was wrong with the RC scripts the way they were ?".

Without skipping a beat.....
  1. Easy support of multiple service instances
  2. Deterministic location of service log files
  3. Timeouts on the start and stop methods to prevent system boots from hanging indefinetely.
  4. Quickly observable service state
  5. Flexible service dependencies
  6. Automatic restarting of the service upon failure
Upon closer inspection, recognizing when the service terminated and restarting it automatically isn't that special for mysql. The mysqld_safe daemon actually performs that step, restarting the database server if it fails. Yes, this is unique to mysql and may not exist for other services. Certianly, if the mysqld_safe parent actually fails, SMF does provide an additional capability by automatically restarting it. But we need more.

Most of the service migration demonstrations are single instance with no downstream application dependencies - so we still need more.

The mysql service start script runs through a set of configuration files, setting variables and starting a detached daemon, so it's highly unlikely that it will ever get stuck. Sure, it can get hacked and have bad things happen to it, but as delivered it is relatively safe. So we still need more.

The answer to the question lies in security. SMF provides a rich set of security features that demonstrate the power of Solaris Role Based Access Control (RBAC) and least privilege. Contrary to what you might think, these features are quite easy to use - once you learn a few simple concepts. This is how we will answer the question "what was wrong with the RC scripts the way they were?".

Authorizations

One of the most useful applications of RBAC is to create adminstration and operations roles. While the details of these roles will vary from customer to customer, the common theme is that operator roles should be able to start and stop a service in a safe manner and an administrative role should be able to modify service properties (of which some of those may be the ability to start or stop the service).

Historically this has been accomplished by third party security software inserting itself all over the kernel (sometimes in a manner that makes upgrades or maintenance difficult) or custom scripts that make use of setuid(2). Solaris 10 can perform many of these functions with just a few entries to some configuration files, and SMF makes this process extremely easy.

You can get lots of valuable information on Solaris Security features (roles, profiles, auths, privileges) at the OpenSolaris Security Community. As you navigate the wealth of white papers, ARC cases, and how-to examples, think of Solaris authorizations as the magic that makes this possible (or more precisely simple).

In a sentence, auths are labels that a privileged application uses to restrict access to it's features. In our case the privileged applications are svcadm(1M) and svccfg(1M). If you read the smf_security(5) man page (which is excellent reading) you will see that SMF provides several authorizations. Now this is getting interesting. So it appears that we can use either the action or modify authorization for the operator role. So which one do we use ?

The action_authorization would only allow running the method but not modifying any of the properties. The implication is that you can do
# svcadm enable -t mysql
but not
# svcadm enable mysql
The difference between the two commands is that enable without -t will try to set the property general/enabled to true in additional to running the start method. This would require the value_authorization. But value_authorization will allow you to change (almost) any property in the property group (in this case the general property group), so let's see what else value_authorization will let you do.
# svcprop -p general ssh
general/enabled boolean true
general/action_authorization astring solaris.smf.manage.ssh
general/entity_stability astring Unstable
general/single_instance boolean true
Hmmm, the only properties that might be abused would be the authorizations, but those require additional authorizations (solaris.smf.modify) to change. So it would seem that value_authorization would be safe for an operator role - unorthodox perhaps, but safe. modify_authorization would allow the creation of other service properties, and if limited to the general property group might be confusing, but relatively harmless - unless of course we add a new general property later. For this reason, modify_authorization would not be a good canidate for an operator role.

So which authorization to use ? Use action_authorization if you want a user (or role) to be able to start and stop the service, but not make the change permanent. This is the most common case. Use value_authorization in the general property group if you want that user or role to be able to permanently turn a service on or off - this is generally an adminstrative role.

Let's put this all together.

Start with your existing SMF manifest for MySQL. If you don't have one, you can use mine at http://blogs.sun.com/resources/bobn/mysql.xml or Keith Lawson's contributed MySQL manifest at the OpenSolaris SMF Contributed Manifests and Methods page.

Add the following section
<property_group name='general' type='framework'>
        <propval name='action_authorization' type='astring'    value='mysql.operator' />
       <propval name='value_authorization' type='astring'   value='mysql.administrator' />
</property_group>

Import the new manifest by the method of your choice (svccfg import, /lib/svc/method/manifest-import, or reboot) and your new MySQL can be managed by auths. So how to we get those auths assigned to users (or roles ?).

Authorizations are granted to users and roles by the configuration file /etc/user_attr. You can read the user_attr(4) man page for all of the details, but the process is to add auths=mysql.operator to the user or role entry. For example
# grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator
It is possible that a user or role may not be present in /etc/user_attr. In that case just add a line like the one above and assign the appropriate auth.

Let's see all of this in action.
% auths
mysql.operator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable

% svcadm enable -t mysql
% svcs mysql
STATE          STIME    FMRI
online         15:51:02 svc:/application/mysql:default

So far so good.
% svcadm enable mysql
svcadm: svc:/application/mysql:default: Permission denied.

Why did this fail ?
% svcprop -p general mysql
general/enabled boolean true
general/action_authorization astring mysql.operator
general/entity_stability astring Unstable
general/single_instance boolean true
general/value_authorization astring mysql.administrator

Because enable also tries to set the general/enabled property - and that requires value or modify authorization. Change my user definition in /etc/user_attr
% grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator,mysql.administrator
% auths
mysql.operator,mysql.administrator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable

% svcadm enable mysql
% svcs mysql
STATE          STIME    FMRI
online         16:10:37 svc:/application/mysql:default

This is all very cool - but we can still do more.

Removing Root from the Equation

For both simplicity and compatibility with other operating systems, the MySQL service is started by a script that is run as root. This script is generally linked into /etc/rc3.d, but since we have converted it to an SMF service we have many more options. We have already looked at delegated administration using auths, time to turn our attention to privileges.
# /etc/sfw/mysql/mysql.server start # ps -ef | grep mysqld | grep -v grep mysql 1975 1955 0 21:43:17 pts/8 0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --user=mysql --pid root 1955 1 0 21:43:17 pts/8 0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa # /etc/sfw/mysql/mysql.server stop This suggests two immediate questions. Does the parent mysqld_safe really have to run as root, or can it be started as a lesser privileged user ? If it can run as a non-root user, exactly what privileges are required to run mysql ?

The answer to the first question is simple: it can be run as a regular user. It only runs as root out of convenience to operating systems that don't have as sophisticated a security framework as Solaris.
#  su - mysql
Sun Microsystems Inc.   SunOS 5.11      snv_57  October 2007
$ sh /etc/sfw/mysql/mysql.server start
$ /usr/sfw/bin/mysqladmin status
Uptime: 1174  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.001
$ sh /etc/sfw/mysql/mysql.server stop
Killing mysqld with pid 1975
Wait for mysqld to exit done
$ exit
#
Now that we have established the fact that a fully privileged user isn't required to run MySQL, what privileges are are really required ? How far can we restrict the mysql user ? Glenn Brunette's privilege debugger privdebug.pl is the perfect tool to help us answer this question.
# privdebug.pl -f -v  -e "su - mysql /usr/sfw/sbin/mysqld_safe --user=mysql"
STAT TIMESTAMP          PPID   PID    PRIV                 CMD
USED 2005619300419      2211   2212   proc_taskid          su
USED 2005620883559      2211   2212   proc_setid           su
USED 2005621147993      2211   2212   proc_setid           su
USED 2005621161490      2211   2212   proc_setid           su
USED 2005621165094      2211   2212   proc_setid           su
USED 2005630560973      2211   2212   proc_exec            su
Starting mysqld daemon with databases from /var/mysql                                  contract_event       
USED 2005679230394      2211   2212   proc_fork            sh
USED 2005750348321      2211   2212   proc_fork            sh
USED 2005751386190      2212   2214   proc_exec            sh
USED 2005756249415      2211   2212   proc_fork            sh
USED 2005757238096      2212   2215   proc_fork            sh
USED 2005758495289      2212   2215   proc_exec            sh
USED 2005761778059      2211   2212   proc_fork            sh
USED 2005762623018      2212   2217   proc_fork            sh
USED 2005763874569      2212   2217   proc_exec            sh
USED 2005767441408      2211   2212   proc_fork            sh
USED 2005768337263      2212   2219   proc_exec            sh
USED 2005772916576      2211   2212   proc_fork            sh
USED 2005773996432      2212   2220   proc_fork            sh
USED 2005775465400      2212   2220   proc_exec            sh
USED 2005778750305      2211   2212   proc_fork            sh
USED 2005779846375      2212   2222   proc_exec            sh
USED 2005782042348      2211   2212   proc_fork            sh
USED 2005783110622      2212   2223   proc_exec            sh
USED 2005785636236      2211   2212   proc_fork            sh
USED 2005786824801      2212   2224   proc_exec            sh
USED 2005788593079      2212   2224   proc_exec            nohup
USED 2005790693138      2212   2224   proc_exec            nohup
USED 2005792812264      2211   2212   proc_fork            sh
USED 2005794010658      2212   2225   proc_exec            sh
USED 2005795756145      2212   2225   proc_exec            nohup
USED 2005797704273      2212   2225   proc_exec            nohup
NEED 2005799674735      2211   2212   file_dac_write       sh
USED 2005800708905      2211   2212   proc_fork            sh
USED 2005801869396      2212   2226   proc_exec            sh
USED 2005804780370      2211   2212   proc_fork            sh
USED 2005805854317      2212   2227   proc_exec            sh
USED 2005807860051      2211   2212   proc_fork            sh
USED 2005808907677      2212   2228   proc_exec            sh
USED 2005811293197      2211   2212   proc_fork            sh
USED 2005812393916      2212   2229   proc_exec            sh
USED 2005814589669      2212   2229   proc_exec            nohup
USED 2005816674186      2212   2229   proc_exec            nohup
STOPPING server from pid file /var/mysql/pandora.pid                                  contract_event       
070325 22                    11  mysqld ended 18     contract_event       


Ignore the proc_taskid and proc_setid, they are artifacts of using su(1M) to run the database server as user mysql. We see that mysqld only needs proc_fork and proc_exec. The file_dac_write failure comes from a call to access(2) and is not needed for proper operation.

What do we do with what we have just learned ?

Referring to the smf_method(5) man page (another excellent read), it seems that all we need to do is add a method_credential option to the various methods (start, stop, and refresh). The appropriate section of my new and improved MySQL manifest now looks like
        <exec_method   type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='60'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
               </method_context>
        </exec_method>
        
        <exec_method   type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='120'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
        
        <exec_method   type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart'  timeout_seconds='120'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
   

So we quickly modify our manifest and import it using one of the standard methods (svccfg import, /lib/svc/method/manifest-import, or a reboot) and we should be done, right ? Well...... not exactly - but we're close.
% svccfg enable mysql
% svcs mysql
STATE          STIME    FMRI
maintenance    21:53:37 svc:/application/mysql:default

$ tail -5 `svcprop -p restarter/logfile mysql`
[ Mar 26 21:51:12 Method "stop" exited with status 0 ]
[ Mar 26 21:53:36 Enabled. ]
[ Mar 26 21:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
svc.startd could not set context for method: chdir: No such file or directory
[ Mar 26 21:53:37 Method "start" exited with status 96 ]

Doh! When we followed the MySQL installation instructions at /etc/sfw/mysql/README.solaris.mysql we created a user account called mysql. But we didn't specify a home directory, did we ? No - so the default template value of /home/mysql was used. But there is no /home/mysql, is there ? Well, no.

How do we fix this ?

Set a reasonable home directory for the mysql user. How about /var/mysql ? Elsewhere in the installation instructions we did set ownership and proper permissions to this directory - so that would seem like a reasonable home directory.

As root
# usermod -d /var/mysql mysql
That is one solution, but it may not be practical for all cases. Perhaps a better idea would be to provide a working directory for each of the methods. The benefit is that I could set it differently for each service instance. This would be done in the method_context tag for the method. So I modify my service manifest to look like
        <exec_method   type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='60'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
               </method_context>
        </exec_method>
        
        <exec_method   type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='120'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
        
        <exec_method   type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart'  timeout_seconds='120'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
Reimport the manifest and let's see how things go.
# svccfg import /var/svc/manifest/application/mysql.xml
# svcadm clear mysql
# svcs mysql
STATE          STIME    FMRI
maintenance    22:17:49 svc:/application/mysql:default

Argh - now what ?
# tail -5 `svcprop -p restarter/logfile mysql`
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]
[ Mar 26 22:17:49 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]

Doh! Since Solaris delivers MySQL as a legacy service the start script doesn't have execute permissions for the mysql user. That's easy to fix.
# ls -l /etc/sfw/mysql/mysql.server
-rwxr--r--   1 root     sys         5655 Mar 22 17:05 /etc/sfw/mysql/mysql.server
# chown mysql /etc/sfw/mysql/mysql.server
# svcadm clear mysql
# svcs mysql
STATE          STIME    FMRI
online         22:23:08 svc:/application/mysql:default
bash-3.00$ 
Now that's more like it. One last item to check.
# ps -ef | grep mysqld | grep -v grep
   mysql 12656 12634   0 22:23:11 ?           0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --pid-file=/var/my
   mysql 12634     1   0 22:23:09 ?           0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa
   
# ppriv 12634
12634:  /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var
flags = 
        E: basic,!file_link_any,!proc_info,!proc_session
        I: basic,!file_link_any,!proc_info,!proc_session
        P: basic,!file_link_any,!proc_info,!proc_session
        L: all

Now that's what I wanted to see. The parent mysqld_safe is now running as user mysql and with exactly the right privileges. This is very cool indeed. Armed with this information we could also create a zone and use the limitpriv attribute to restrict the zone privilege - but we'll leave that for another day.

Conclusion

It is quite easy to leverage not only Solaris authorizations but to run services with restricted privileges. We have presented a few templates and a general approach that should make this process less cumbersome.

More important though - we now have a compelling reply when asked "what was wrong with the RC scripts the way they were?"

Technocrati Tags:

Mar 26 2007, 08:13:24 PM CDT Permalink Comments [4]

20070320 Tuesday March 20, 2007
Zones in a Flash - Literally
Fantastic improvements have been made in the Solaris installation and upgrade process - even more in OpenSolaris (available in the various community releases). As we examined the cloning feature introduced in Solaris 10 11/06, it became apparent that we have stumbled upon a most intriguing capability. When combining zone cloning with the attach/detach capability we have discovered a model for flashing zones: zoneflash.

In a recent boot camp we took a look at this in more detail. Unfortunately the slides (which will be posted soon) didn't quite follow the level of depth we were exploring. Several people asked for notes on how this works - and here they are. The irony is that it will take longer to read about it than it does to perform the actual process - but it is so cool.

The Promise

We start with a fresh Solaris system. In this case just live upgraded from media, but it could have been jumpstarted from media or a flash archive. The key point here is that the system has had very little done to it, other than naming and some software installation. Since zone attach makes sure that key system components (specifically packages and patches) are compatible, it makes sense to build our flashzones on a system that will look similar to those that will be built in the future.

So how many zones will we build ? That's a good question. If this were system flasharchives the answer would be as few as possible - one per architecture in the most efficient case. But these zoneflashes are different - just applications, some metadata, and perhaps some customizations (naming, security, SMF). It seems reasonable to create one zoneflash for each type of application server you would deploy - think of it as a userspace template. In this example I have chosen four: a blank uncustomized flash (for building a new zoneflash in a flash), database server (MySQL), web server (apache2), and the community edition of webmin (just another application).

Our procedure will be to build a minimal default zoneflash, run it through first boot to populate the SMF repository, and then clone it for the remaining zoneflashes. Each of these will be booted, customized for the particular application, and tested to make sure everything is operating properly.

We will then detach the zones and move the detached zoneroots onto some media that can be transported. Of course, keeping with the theme of zones and flash, the transport could be the flasharchive itself. How cool would it be to jumpstart a server using flasharchives and have all the application zones already present in a known location, such as /zoneflash ? Unfortunately, I'm sitting in seat 18A on an American Airlines flight to Los Angeles and don't quite have the required infrastructure to do that sort of test. But I do have a USB stick and multiple boot environments. That will do nicely.

Once attached, we will clone the zoneflashes as necessary, adding resources (network, local filesystems) and attributes (resource controls) required for the proper operation of the application. When finished we will detach the zoneflashes so they may be used elsewhere.

The Turn

The first step is to build and boot a simple generic sparse root zone. Since this zone isn't really meant for operation, most zonecfg attributes (network configuration, resource limits, et al) will be skipped. We will add them later when we build the real zones - remember, these are just user space application templates.

# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create
zonecfg:flashdefault> set zonepath=/local/default
zonecfg:flashdefault> add inherit-pkg-dir; set dir=/opt; end
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
#
# zoneadm -z default install
A few minutes later we have an installed zone, ready for first boot. Since I've attended my Solaris Zones Best Practices class, or at least read the materials, I know how to build a sysidcfg file that will satisfy the sysidtool first boot service. This will allow the zone to boot up all the way without any additional console interaction. Let's do that for our new zone.
# echo > /local/default/root/etc/sysidcfg <<EOF
name_service=NONE
nfs4_domain=dynamic
security_policy=NONE
root_password=xxxxxxxxxx        You supply your own encrypted string from /etc/shadow, I'm not going to post mine!
system_locale=C
terminal=ansi
timezone=US/Central
network_interface=NONE {hostname=default}
EOF
# zoneadm -z default boot
# zlogin -C default 
We need to let first boot processing complete. Since we supplied a valid sysidcfg, it is just a matter of waiting for manifest-import and sysidtool to complete their magic. When complete, login in and take a look around to make sure all is well. Once satisfied, shut down the zone (either from inside the zone or from the global zone) - we are through with it for now.
(from the global zone)
# zoneadm -z default halt
Now we are done with this first zone. Time to clone it for our remaining application zones. Please pardon a bit of inline shell scripting - I hate to type the same thing over and over and over. Sort of makes for a nice script template, doesn't it ? Not quite the sophistication of Brad Digg's zonemanager, but it will do nicely for our example.

# for zone in webmin mysql web
? do
        echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
        zoneadm -z ${zone} clone default
        echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
        echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
        echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
        echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
        echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
        echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
        echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
        echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
        zoneadm -z ${zone} boot
done
#
What in the heck was that all about ? OK, one more time - line by line with annotation.

# for zone in webmin mysql web
do

A quick interactive loop for the creation of three application zones. The variable ${zone} will be set to the name of the zone we are trying to construct.
echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
A one liner that creates a new zone configuration based on the already existing default. At this point the only thing we need to change is the zonepath, and it should be set to /local/${zone}.
        zoneadm -z ${zone} clone default
We recognize this as a zone cloning operation. The zone root is copied and a /reconfigure is created in the new zone root so that sysidtool performs a complete configuration on first boot. If you happen to be running on a recent release of OpenSolaris, you can put your zoneroot on ZFS and the cloning operating will only take a few seconds and very little additional disk space will be required. Those of us on Solaris 10 11/06 will have to wait for the 160MB or so to be copied. Still better than the 9 minutes to go through a complete zone installation.
        echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
        echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
        echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
        echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
        echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
        echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
        echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
        echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
This step creates a custom sysidcfg file for each zone. Remember to supply your own root password from /etc/shadow in the global zone. This answers all of the sysidtool questions, including the NFSv4 question.
	zoneadm -z {zone} boot
Boot the zone. If we have done everything correctly, the next interaction will be with console login.

done
Close the for loop in the interactive script. This process will take a few minutes on Solaris 10 11/06, or if we are being clever with OpenSolaris and ZFS - a few seconds.

Now for the hard part - customizing the individual application zones. Well, it's not all that difficult. And if you do this regularly, you probably have scripts to do most of the work. It's just individual application installation and customization.

Here is what I did for my example zones.
MySQL
The installation instructions for the Solaris 10 MySQL can be found in /etc/sfw/mysql/README.solaris.mysql. There is a typo in the Solaris 10 version of the README. It will cause a lot of grief if you cut and paste without looking at the results. Fortunately it has been corrected in nevada (aka OpenSolaris Community Edition).

Boot the mysql zone and log in as root.
# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql       This line is incorrect in the Solaris 10 README - my chmod works better with two arguments
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf
The installation instructions continue by linking the start script into /etc/rc3.d. Since we are big SMF fans in these parts, let's do that instead. Feel free to use my MySQL manifest as it contains a couple of cool features (value and action authorizations - more on that later).

Since the mysql zone doesn't have any networking configured, perform this next step from the global zone. If you already have a suitable manifest, or have stashed mine away somewhere in the global zone you can use that instead.
# cd /local/mysql/root/var/svc/manifest/application
# wget http://blogs.sun.com/bobn/resource/mysql.xml
It's probably a good idea to make sure that all of this is working properly. Either reboot the mysql zone, run the manifest-import service manually, or run svccfg import on the new manifest. Your choice. What you should see upon completion is
# svcs mysql
STATE          STIME    FMRI
online         14:41:19 svc:/application/mysql:default

# /usr/sfw/bin/mysqladmin status
Uptime: 459  Threads: 1  Questions: 2  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.004

We're done for now. Unless of course you want to go for some extra credit. In that case
  1. Set up a web server with PHP support. Apache 1 plus the SFWmphp package from the Solaris Companion will do just fine.
  2. Download and unpack phpMyAdmin in the webserver htdocs directory.
  3. Create a user with the mysql.operator authorization
  4. Create a user with the mysql.administrator authorization

Shut down the mysql zone.
Web
This is about as easy as it gets. Boot the web zone and perform the following steps.
# cp /etc/apache2/httpd.conf-example /etc/apache2/httpd.conf
# svcadm enable apache2

A quick check to make sure all is well.
# svcs apache2
STATE          STIME    FMRI
online         17:17:41 svc:/network/http:apache2


# telnet localhost 80
Trying ::1...
telnet: connect to address ::1: Network is unreachable
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>501 Method Not Implemented</title>

Connection to localhost closed by foreign host.
We're done for now. Shut down the web zone.
Webmin
This one is a little more complicated. We did this one last time in the zone cloning, but it is worth a second look.

Our task here is to replace the Solaris webmin with the latest download from http://webmin.com The technique we are using will allow us to install a custom version of an application into a sparse root zone. Specifically, webmin.com's package installs into /opt/webmin, but /opt is a read-only inherited-pkg-dir. The easiest solution for this would be the creation of a symbolic link in the global zone /opt to point to a location that can be safely written by each non-global zone. In my example that would be /local-pkgs.

In the global zone, create the link in /opt, create the local package directory in the webmin zoneroot, and download the latest webmin package.
# ln -s ../local-pkgs/webmin /opt/webmin
# mkdir -p /local/webmin/root/local-pkgs/webmin
# cd /local/webmin/root/var/tmp
# wget http://prdownloads.sourceforge.net/webadmin/webmin-1.330.pkg.gz
# gunzip webmin-1.330.pkg.gz

Now boot the webmin zone and log in as root.
# zoneadm -z webmin boot
# zlogin webmin
Remove the Solaris webmin packages (SUNWwebminu SUNWwebminr). The usr package needs to be removed twice - the first pkgrm will leave it as a partially installed package, the second will completely remove it - at least as far as our zone (and future patching) is concerned. Once removed, install the webmin.com version, which should be conveniently located in /var/tmp.
# pkgrm SUNWwebminu SUNWwebminr SUNWwebminu
# pkgadd -d /var/tmp/webmin-1.330.pkg
We are done with this zone. Shut it down.
Detach
We have just built four zones: an empty zone suitable for future customizations, one with the Solaris webmin replaced by the community edition, one with a working MySQL database, and one with a webserver. The last task to be performed on these zones in their current state is to be detached, another new feature in Solaris 10 11/06. Zone detach will copy the zone configuration into the zoneroot (to be used with a subsequent zone attach) and sets the current zone state to configured. You can even delete the zone configurations as a final cleanup prior to building a flash archive.
# zoneadm -z default detach
# zoneadm -z webmin detach
# zoneadm -z mysql detach
# zoneadm -z web detach
# zonecfg -z default delete -F
# zonecfg -z webmin delete -F
# zonecfg -z mysql delete -F
# zonecfg -z web delete -F

And flash
Unless the person in 18B wants to be a jumpstart server, we will have to simulate jumpstart/flasharchive process. We can do this by booting into an alternate boot environment and then delivering the detached zoneroots by some sort of shared or removable storage - something like a USB memory stick. When we are done with this exercise, our zoneflashes will still be on the memory device, ready for their next use. Since the zones will never be booted, just cloned, the speed of the memory device really isn't important.

We need to prepare the USB memory stick (currently formatted as FAT16). We will use rmformat -l to locate the device, fdisk to put a proper label on it, finally newfs for installing a proper file system. ZFS would be interesting, but it would just get in our way later.
# rmformat -l
Looking for devices...
     1. Logical Node: /dev/rdsk/c2t0d0p0
        Physical Node: /pci@0,0/pci1179,1@1d,7/storage@4/disk@0,0
        Connected Device:          USB DISK 2.0     PMAP
        Device Type: Removable
        Bus: USB
        Size: 984.0 MB
        Label: 
        Access permissions: 
     2. Logical Node: /dev/rdsk/c1t0d0p0
        Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
        Connected Device: TEAC     DW-224E-A        7.2A
        Device Type: CD Reader
        Bus: IDE
        Size: 
        Label: 
        Access permissions: 

# fdisk /dev/rdsk/c2t0d0p0
3 (to delete the existing partition)
1 (to create a new Solaris partition)
5 (to exit and write the new label)

# newfs /dev/rdsk/c2t0d0s2
newfs: construct a new file system /dev/rdsk/c2t0d0s2: (y/n)? y
/dev/rdsk/c2t0d0s2:     2009088 sectors in 981 cylinders of 64 tracks, 32 sectors
        981.0MB in 62 cyl groups (16 c/g, 16.00MB/g, 7680 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 32832, 65632, 98432, 131232, 164032, 196832, 229632, 262432, 295232,
 1705632, 1738432, 1771232, 1804032, 1836832, 1869632, 1902432, 1935232,
 1968032, 2000832
 
# mkdir /tmp/flash
# mount /dev/dsk/c2t0d0s2 /tmp/flash
# cd /local
# find default webmin web mysql -print | cpio -pdum /tmp/flash
# umount /tmp/flash
We are now done with the original system. At this point we would create a flasharchive (with the detached zoneroots in a convenient place in the archive).

The Prestige

The final act in our magic trick is the delivery. Specifically the transport, reattachment, and subsequent cloning of the zoneflashes on a new system. 18B is now asleep and I really don't want to disturb him, so I'll do this part myself. I'll boot my laptop into another boot environment - built from the same media using the same Live Upgrade method as the boot environment that created the zones.

We begin by mounting the removable media (USB memory stick) that contains the zoneflash. Do take a look around, it is quite likely that our friend volfs has already done this for us. Remember - if we were using a flasharchive to deliver the zoneflash this step would be unnecessary.
# mkdir /flash
# mount /dev/dsk/c2t0d0s2 /flash        (we used rmformat -l to derive the device name)
Now that our zoneflashes have arrived, time to reattach them. The first step is to create zone configurations. If you recall, these were stored in the zoneroot when they were detached. The zonecfg command create -a is used to retrieve the stored configuration information and adapt it to the new system - specifically the new location of the zoneroot. Once configured we use zoneadm attach to reconnect them.

The sequence to reattach our default zone, now called flashdefault, would look something like this.
# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create -a /flash/default
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
# zoneadm -z flashdefault attach
We'll be a little more clever attaching the other three zones.
# for zone in webmin web mysql
  do
      echo "create -a /flash${zone}" | zonecfg -z flash${zone}
      zoneadm -z flash${zone} attach
  done
At this point our zoneroots are still on the USB memory device - but don't worry, these zones will never be booted. Their only purpose is to deliver preconfigured zones. We will use zone cloning to create our real application zones.

Which we will now do. It is very convenient to use the flashzone as a template for our new zone in case there were some special attributes like limitpriv that we might want to preserve. We will also need to add items that were not present in the zoneflashes - specifically networking and local file systems. Once we are satisfied with the zone configurations we will clone the zoneflash. If we are only building one of each type of zone we can detach the zoneflash so that other administrators can use it on their systems.

Let's do this for the mysql zone.
# zonecfg -z mysql
mysql: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:mysql> create -t flashmysql
zonecfg:mysql> set zonepath=/zones/mysql
zonecfg:mysql> add net; set physical=e1000g0; set address=192.168.100.102/24; end
zonecfg:mysql> add fs; set dir=/export; set special=/export; set options=[rw,nosuid,nodevices]; set type=lofs; end
zonecfg:mysql> commit
zonecfg:mysql> exit

# zoneadm -z mysql clone flashmysql
Copying /flash/mysql...

# zoneadm -z flashmysql detach

# echo "name_service=NONE" >    /zones/mysql/root/etc/sysidcfg
# echo "nfs4_domain=dynamic" >> /zones/mysql/root/etc/sysidcfg
# echo "security_policy=NONE" >> /zones/mysql/root/etc/sysidcfg
# echo "root_password=xxxxxxxxxxx" >> /zones/mysql/root/etc/sysidcfg
# echo "system_locale=C" >> /zones/mysql/root/etc/sysidcfg
# echo "network_interface=NONE {hostname=mysql}" >> /zones/mysql/root/etc/sysidcfg
# echo "terminal=ansi" >> /zones/mysql/root/etc/sysidcfg
# echo "timezone=US/Central" >> /zones/mysql/root/etc/sysidcfg

And for the finale - boot the newly flashed mysql zone and you should see an enabled and operating mysql service.
# zoneadm -z mysql boot
# zlogin -C mysql
[Connected to zone 'mysql' console]
Hostname: mysql
Creating new rsa public/private host key pair                           
Creating new dsa public/private host key pair
Mar 20 06:15:44 mysql sendmail[1719]: My unqualified host name (mysql) unknown; sleeping for retry
Mar 20 06:15:44 mysql sendmail[1722]: My unqualified host name (mysql) unknown; sleeping for retry

mysql console login: root
Password: 
Last login: Mon Mar 19 17:10:10 on console
Mar 20 06:15:49 mysql login: ROOT LOGIN /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_57  October 2007
# 
# svcs mysql
STATE          STIME    FMRI
online          6:31:28 svc:/application/mysql:default
# /usr/sfw/bin/mysqladmin status
Uptime: 8  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.125

How cool is that ? Not only did we clone the zone, but since the database is in /var, it was cloned as well. Perhaps not practical for every situation, but still pretty cool.

I will leave the flashing of defau