
Thursday March 13, 2008
Question of the Day - March 13, 2008
When a girl floats, why do they still call it buoyancy ?

Wednesday March 12, 2008
It was 30 years ago today......
It is hard to believe, but it was 30 years ago today (March 12, 1978) when the Rock in Opposition (RIO) festival was held
at the New London Theatre. Hosted by Henry Cow, the lineup included
- Henry Cow (England)
- Stormy Six (Italy)
- Samla Mammas Manna (Sweden)
- Univers Zero (Belgium)
- Etron Fou Leloublan (France)
Later festivals would include Art Zoyd (France), Aksak Maboul (Belgium), and Art Bears (replacing the
then defunct Henry Cow), but it was this one chance meeting of the five original bands that would change
the face of avant/progressive rock forever.
Samla Mammas Manna and
Univers Zero are still active bands
and have been known to headline various progressive rock festivals around the world. Etron Fou Leloublan is long gone, but it's spirit
continues with Volapük.

Monday February 18, 2008
ZFS and FMA - Two great tastes .....
Our good friend
Isaac Rozenfeld talks about the Multiplicity of
Solaris. When talking about Solaris I will use the phrase "The Vastness of Solaris".
If you have attended a Solaris Boot Camp or Tech Day in the last few years you get
an idea of what we are talking about - when we go on about Solaris hour after hour
after hour.
But the key point in Isaac's multiplicity discussion is how the cornucopia of
Solaris features work together to do some pretty spectacular (and competitively
differentiating) things. In the past we've looked at combinations such as
ZFS and Zones or
Service Management, Role Based Access Control (RBAC) and Least Privilege. Based on
a conversation last week in St. Louis, let's consider how ZFS and Solaris
Fault Management (FMA) play together.
Preparation
Let's begin by creating some fake devices that we can play with. I don't have enough disks
on this particular system, but I'm not going to let that slow me down. If you have sufficient
real hot swappable disks, feel free to use them instead.
# mkfile 1g /dev/disk1
# mkfile 1g /dev/disk2
# mkfile 512m /dev/disk3
# mkfile 512m /dev/disk4
# mkfile 1g /dev/disk5
Now let's create a couple of zpools using the fake devices.
pool1 will be a 1GB
mirrored pool using
disk1 and
disk2.
pool2 will be a 512MB mirrored
pool using
disk3 and
disk4. Device
spare1 will spare both pools in case of a problem -
which we are about to inflict upon the pools.
# zpool create pool1 mirror disk1 disk2 spare spare1
# zpool create pool2 mirror disk3 disk4 spare spare1
# zpool status
pool: pool1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
pool: pool2
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk3 ONLINE 0 0 0
disk4 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
So far so good. If we were to run a scrub on either pool, it will complete immediately.
Remember that unlike hardware RAID disk replacement,
ZFS scrubbing and resilvering only
touches blocks that contain actual data. Since there is no data in these pools (yet),
there is little for the scrubbing process to do.
# zpool scrub pool1
# zpool scrub pool2
# zpool status
pool: pool1
state: ONLINE
scrub: scrub completed with 0 errors on Mon Feb 18 09:24:16 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
pool: pool2
state: ONLINE
scrub: scrub completed with 0 errors on Mon Feb 18 09:24:17 2008
config:
NAME STATE READ WRITE CKSUM
pool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk3 ONLINE 0 0 0
disk4 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
Let's populate both pools with some data. I happen to have a directory of
scenic images that I use as screen backgrounds - that will work nicely.
# cd /export/pub/pix>
# find scenic -print | cpio -pdum /pool1
# find scenic -print | cpio -pdum /pool2
# df -k | grep pool
pool1 1007616 248925 758539 25% /pool1
pool2 483328 248921 234204 52% /pool2
And yes, cp -r would have been just as good.
Problem 1: Simple data corruption
Time to inflict some harm upon the pool. First, some simple corruption.
Writing some zeros over half of the mirror should do quite nicely.
# dd if=/dev/zero of=/dev/dsk/disk1 bs=8192 count=10000 conv=notrunc
10000+0 records in
10000+0 records out
At this point we are unaware that anything has happened to our data. So let's
try accessing some of the data to see if we can observe ZFS self healing in action.
If your system has plenty of memory and is relatively idle, accessing the data may
not be sufficient. If you still end up with no errors after the cpio, try a
zpool scrub - that will catch all errors in the data.
# cd /pool1
# find . -print | cpio -ov > /dev/null
416027 blocks
Let's ask our friend fmstat(1m) if anything is wrong ?
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.1 0 0 0 0 0 0
disk-transport 0 0 0.0 366.5 0 0 0 0 32b 0
eft 0 0 0.0 2.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 1 0 0.0 0.2 0 0 0 0 0 0
io-retire 0 0 0.0 1.1 0 0 0 0 0 0
snmp-trapgen 1 0 0.0 16.0 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 620.3 0 0 0 0 0 0
syslog-msgs 1 0 0.0 9.7 0 0 0 0 0 0
zfs-diagnosis 162 162 0.0 1.5 0 0 1 0 168b 140b
zfs-retire 1 1 0.0 112.3 0 0 0 0 0 0
As the guys in the Guinness commercial say, "Brilliant!" The important thing to note
here is that the zfs-diagnosis engine has run several times indicating that there is
a problem somewhere in one of my pools. I'm also running this on Nevada so the
zfs-retire engine has also run, kicking in a hot spare due to excessive errors.
So which pool is having the problems ? We continue our FMA investigation
to find out.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH Major
Fault class : fault.fs.zfs.vdev.checksum
Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
# zpool status -x
pool: pool1
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress, 44.83% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
mirror DEGRADED 0 0 0
spare DEGRADED 0 0 0
disk1 DEGRADED 0 0 162 too many errors
spare1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 INUSE currently in use
errors: No known data errors
This tells us all that we need to know. The device
disk1 was found to have
quite a few checksum errors - so many in fact that it was replaced automatically
by a hot spare. The spare was
resilvering
and a full complement of data replicas would be available soon. The entire process was
automatic and completely observable.
Since we inflicted harm upon the (fake) disk device ourself, we know that it is in fact quite
healthy. So we can restore our pool to its original configuration rather simply - by detaching
the spare and clearing the error. We should also clear the FMA counters and repair the
ZFS vdev so that we can tell if anything else is misbehaving in either this or another pool.
# zpool detach pool1 spare1
# zpool clear pool
# zpool status pool1
pool: pool1
state: ONLINE
scrub: resilver completed with 0 errors on Mon Feb 18 10:25:26 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
# fmadm reset zfs-diagnosis
# fmadm reset zfs-retire
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 223.5 0 0 0 0 32b 0
eft 1 0 0.0 4.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 4 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 1.1 0 0 0 0 0 0
snmp-trapgen 4 0 0.0 8.8 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 372.7 0 0 0 0 0 0
syslog-msgs 4 0 0.0 5.4 0 0 0 0 0 0
zfs-diagnosis 0 0 0.0 1.4 0 0 0 0 0 0
zfs-retire 0 0 0.0 0.0 0 0 0 0 0 0
# fmdump -v -u d82d1716-c920-6243-e899-b7ddd386902e
TIME UUID SUNW-MSG-ID
Feb 18 09:51:49.3025 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
100% fault.fs.zfs.vdev.checksum
Problem in:
Affects: zfs://pool=pool1/vdev=449a3328bc444732
FRU: -
Location: -
# fmadm repair zfs://pool=pool1/vdev=449a3328bc444732
fmadm: recorded repair to zfs://pool=pool1/vdev=449a3328bc444732
# fmadm faulty
Problem 2: Device failure
Time to do a little more harm. In this case I will simulate the failure of
a device by removing the fake device. Again we will access the pool and then
consult fmstat to see what is happening (are you noticing a pattern here????).
# rm -f /dev/dsk/disk2
# cd /pool1
# find . -print | cpio -oc > /dev/null
416027 blocks
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 214.2 0 0 0 0 32b 0
eft 1 0 0.0 4.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 4 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 1.1 0 0 0 0 0 0
snmp-trapgen 4 0 0.0 8.8 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 372.7 0 0 0 0 0 0
syslog-msgs 4 0 0.0 5.4 0 0 0 0 0 0
zfs-diagnosis 0 0 0.0 1.4 0 0 0 0 0 0
zfs-retire 0 0 0.0 0.0 0 0 0 0 0 0
Rats, the find ran totally out of cache from the last example. As before, should
this happen,proceed directly to zpool scrub.
# zpool scrub pool1
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 190.5 0 0 0 0 32b 0
eft 1 0 0.0 4.1 0 0 0 0 1.4M 0
fmd-self-diagnosis 5 0 0.0 0.5 0 0 0 0 0 0
io-retire 1 0 0.0 1.0 0 0 0 0 0 0
snmp-trapgen 6 0 0.0 7.4 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 329.0 0 0 0 0 0 0
syslog-msgs 6 0 0.0 4.6 0 0 0 0 0 0
zfs-diagnosis 16 1 0.0 70.3 0 0 1 1 168b 140b
zfs-retire 1 0 0.0 509.8 0 0 0 0 0 0
Again, hot sparing has kicked in automatically. The evidence of this is the
zfs-retire engine running.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 18 11:07:29 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3 Major
Feb 18 11:16:43 06bfe323-2570-46e8-f1a2-e00d8970ed0d
Fault class : fault.fs.zfs.device
Description : A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for
more information.
Response : No automated response will occur.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
# zpool status -x
pool: pool1
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
scrub: resilver in progress, 4.94% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
mirror DEGRADED 0 0 0
disk1 ONLINE 0 0 0
spare DEGRADED 0 0 0
disk2 UNAVAIL 0 0 0 cannot open
spare1 ONLINE 0 0 0
spares
spare1 INUSE currently in use
errors: No known data errors
As before, this tells us all that we need to know. A device (disk2) has failed and
is no longer in operation. Sufficient spares existed and one was automatically
attached to the damaged pool. Resilvering completed successfully and the data is
once again fully mirrored.
But here's the magic. Let's repair the device - again simulated with our fake
device.
# mkfile 1g /dev/dsk/disk2
# zpool repair pool1 disk2
# zpool status pool1
pool: pool1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 4.86% done, 0h1m to go
config:
NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
mirror DEGRADED 0 0 0
disk1 ONLINE 0 0 0
spare DEGRADED 0 0 0
replacing DEGRADED 0 0 0
disk2/old UNAVAIL 0 0 0 cannot open
disk2 ONLINE 0 0 0
spare1 ONLINE 0 0 0
spares
spare1 INUSE currently in use
errors: No known data errors
Get a cup of coffee while the resilvering process runs.
# zpool status
pool: pool1
state: ONLINE
scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 AVAIL
# fmadm faulty
Notice the nice integration with FMA. Not only was the new device resilvered, but
the hot spare was detached and the FMA fault was cleared. The fmstat counters still
show that there was a problem and the fault report still existes in the fault log for later
interrogation.
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 171.5 0 0 0 0 32b 0
eft 1 0 0.0 3.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 6 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 0.9 0 0 0 0 0 0
snmp-trapgen 6 0 0.0 6.8 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 294.3 0 0 0 0 0 0
syslog-msgs 6 0 0.0 4.2 0 0 0 0 0 0
zfs-diagnosis 36 1 0.0 51.6 0 0 0 1 0 0
zfs-retire 1 0 0.0 170.0 0 0 0 0 0 0
# fmdump
TIME UUID SUNW-MSG-ID
Feb 16 11:38:16.0976 48935791-ff83-e622-fbe1-d54c20385afc ZFS-8000-GH
Feb 16 11:38:30.8519 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233 ZFS-8000-GH
Feb 18 09:51:49.3025 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713 ZFS-8000-GH
Feb 18 09:56:24.8029 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
Feb 18 10:23:07.2228 7c04a6f7-d22a-e467-c44d-80810f27b711 ZFS-8000-GH
Feb 18 10:25:14.6429 faca0639-b82b-c8e8-c8d4-fc085bc03caa ZFS-8000-GH
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3
Feb 18 11:16:44.2497 06bfe323-2570-46e8-f1a2-e00d8970ed0d ZFS-8000-D3
# fmdump -V -u 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
TIME UUID SUNW-MSG-ID
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3
TIME CLASS ENA
Feb 18 11:07:27.8476 ereport.fs.zfs.vdev.open_failed 0xb22406c635500401
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
code = ZFS-8000-D3
diag-time = 1203354449 236999
de = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = fmd
authority = (embedded nvlist)
nvlist version: 0
version = 0x0
product-id = Dimension XPS
chassis-id = 7XQPV21
server-id = arrakis
(end authority)
mod-name = zfs-diagnosis
mod-version = 1.0
(end de)
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = fault.fs.zfs.device
certainty = 0x64
asru = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x3a2ca6bebd96cfe3
vdev = 0xedef914b5d9eae8d
(end asru)
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x3a2ca6bebd96cfe3
vdev = 0xedef914b5d9eae8d
(end resource)
(end fault-list[0])
fault-status = 0x3
__ttl = 0x1
__tod = 0x47b9bb51 0x1ef7b430
# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset
# fmadm reset zfs-retire
fmadm: zfs-retire module has been reset
Problem 3: Unrecoverable corruption
For those of you that have attended one of my Boot Camps or Solaris Best Practices training classes know,
House is one of my favorite TV shows - the only one that I watch regularly. And this next example would make a perfect episode. Is it likely to happen ? No, but it is so cool when it does :-)
Remember our second pool,
pool2. It has the same contents as
pool1. Now, let's do the unthinkable - let's corrupt both halves of the mirror. Surely data loss will follow, but the fact that Solaris stays up and running and can report what happened is pretty spectacular. But it gets so much better than that.
# dd if=/dev/zero of=/dev/dsk/disk3 bs=8192 count=10000 conv=notrunc
# dd if=/dev/zero of=/dev/dsk/disk4 bs=8192 count=10000 conv=notrunc
# zpool scrub pool2
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-retire 0 0 0.0 0.5 0 0 0 0 0 0
disk-transport 0 0 0.0 166.0 0 0 0 0 32b 0
eft 1 0 0.0 3.6 0 0 0 0 1.4M 0
fmd-self-diagnosis 6 0 0.0 0.6 0 0 0 0 0 0
io-retire 1 0 0.0 0.9 0 0 0 0 0 0
snmp-trapgen 8 0 0.0 6.3 0 0 0 0 32b 0
sysevent-transport 0 0 0.0 294.3 0 0 0 0 0 0
syslog-msgs 8 0 0.0 3.9 0 0 0 0 0 0
zfs-diagnosis 1032 1028 0.6 39.7 0 0 93 2 15K 13K
zfs-retire 2 0 0.0 158.5 0 0 0 0 0 0
As before, lots of zfs-diagnosis activity. And two hits to zfs-retire. But we
only have one spare - this should be interesting. Let's see what is happenening.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH Major
Feb 18 13:18:42 c3889bf1-8551-6956-acd4-914474093cd7
Fault class : fault.fs.zfs.vdev.checksum
Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 16 11:38:30 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233 ZFS-8000-GH Major
Feb 18 09:51:49 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713
Feb 18 10:23:07 7c04a6f7-d22a-e467-c44d-80810f27b711
Feb 18 13:18:42 0a1bf156-6968-4956-d015-cc121a866790
Fault class : fault.fs.zfs.vdev.checksum
Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
# zpool status -x
pool: pool2
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:
NAME STATE READ WRITE CKSUM
pool2 DEGRADED 0 0 2.60K
mirror DEGRADED 0 0 2.60K
spare DEGRADED 0 0 2.43K
disk3 DEGRADED 0 0 5.19K too many errors
spare1 DEGRADED 0 0 2.43K too many errors
disk4 DEGRADED 0 0 5.19K too many errors
spares
spare1 INUSE currently in use
errors: 247 data errors, use '-v' for a list
So ZFS tried to bring in a hot spare, but there were insufficient replicas to
be able to reconstruct all of the data. But here is where is gets interesting.
Let's see what zpool status -v says about things.
zpool status -v
pool: pool1
state: ONLINE
scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk1 ONLINE 0 0 0
disk2 ONLINE 0 0 0
spares
spare1 INUSE in use by pool 'pool2'
errors: No known data errors
pool: pool2
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:
NAME STATE READ WRITE CKSUM
pool2 DEGRADED 0 0 2.60K
mirror DEGRADED 0 0 2.60K
spare DEGRADED 0 0 2.43K
disk3 DEGRADED 0 0 5.19K too many errors
spare1 DEGRADED 0 0 2.43K too many errors
disk4 DEGRADED 0 0 5.19K too many errors
spares
spare1 INUSE currently in use
errors: Permanent errors have been detected in the following files:
/pool2/scenic/cider mill crowds.jpg
/pool2/scenic/Cleywindmill.jpg
/pool2/scenic/csg_Landscapes001_GrandTetonNationalPark,Wyoming.jpg
/pool2/scenic/csg_Landscapes002_ElowahFalls,Oregon.jpg
/pool2/scenic/csg_Landscapes003_MonoLake,California.jpg
/pool2/scenic/csg_Landscapes005_TurretArch,Utah.jpg
/pool2/scenic/csg_Landscapes004_Wildflowers_MountRainer,Washington.jpg
/pool2/scenic/csg_Landscapes!idx011.jpg
/pool2/scenic/csg_Landscapes127_GreatSmokeyMountains-NorthCarolina.jpg
/pool2/scenic/csg_Landscapes129_AcadiaNationalPark-Maine.jpg
/pool2/scenic/csg_Landscapes130_GettysburgNationalPark-Pennsylvania.jpg
/pool2/scenic/csg_Landscapes131_DeadHorseMill,CrystalRiver-Colorado.jpg
/pool2/scenic/csg_Landscapes132_GladeCreekGristmill,BabcockStatePark-WestVirginia.jpg
/pool2/scenic/csg_Landscapes133_BlackwaterFallsStatePark-WestVirginia.jpg
/pool2/scenic/csg_Landscapes134_GrandCanyonNationalPark-Arizona.jpg
/pool2/scenic/decisions decisions.jpg
/pool2/scenic/csg_Landscapes135_BigSur-California.jpg
/pool2/scenic/csg_Landscapes151_WataugaCounty-NorthCarolina.jpg
/pool2/scenic/csg_Landscapes150_LakeInTheMedicineBowMountains-Wyoming.jpg
/pool2/scenic/csg_Landscapes152_WinterPassage,PondMountain-Tennessee.jpg
/pool2/scenic/csg_Landscapes154_StormAftermath,OconeeCounty-Georgia.jpg
/pool2/scenic/Brig_Of_Dee.gif
/pool2/scenic/pvnature14.gif
/pool2/scenic/pvnature22.gif
/pool2/scenic/pvnature7.gif
/pool2/scenic/guadalupe.jpg
/pool2/scenic/ernst-tinaja.jpg
/pool2/scenic/pipes.gif
/pool2/scenic/boat.jpg
/pool2/scenic/pvhawaii.gif
/pool2/scenic/cribgoch.jpg
/pool2/scenic/sun1.gif
/pool2/scenic/sun1.jpg
/pool2/scenic/sun2.jpg
/pool2/scenic/andes.jpg
/pool2/scenic/treesky.gif
/pool2/scenic/sailboatm.gif
/pool2/scenic/Arizona1.jpg
/pool2/scenic/Arizona2.jpg
/pool2/scenic/Fence.jpg
/pool2/scenic/Rockwood.jpg
/pool2/scenic/sawtooth.jpg
/pool2/scenic/pvaptr04.gif
/pool2/scenic/pvaptr07.gif
/pool2/scenic/pvaptr11.gif
/pool2/scenic/pvntrr01.jpg
/pool2/scenic/Millport.jpg
/pool2/scenic/bryce2.jpg
/pool2/scenic/bryce3.jpg
/pool2/scenic/monument.jpg
/pool2/scenic/rainier1.gif
/pool2/scenic/arch.gif
/pool2/scenic/pv-anzab.gif
/pool2/scenic/pvnatr15.gif
/pool2/scenic/pvocean3.gif
/pool2/scenic/pvorngwv.gif
/pool2/scenic/pvrmp001.gif
/pool2/scenic/pvscen07.gif
/pool2/scenic/pvsltd04.gif
/pool2/scenic/banhall28600-04.JPG
/pool2/scenic/pvwlnd01.gif
/pool2/scenic/pvnature08.gif
/pool2/scenic/pvnature13.gif
/pool2/scenic/nokomis.jpg
/pool2/scenic/lighthouse1.gif
/pool2/scenic/lush.gif
/pool2/scenic/oldmill.gif
/pool2/scenic/gc1.jpg
/pool2/scenic/gc2.jpg
/pool2/scenic/canoe.gif
/pool2/scenic/Donaldson-River.jpg
/pool2/scenic/beach.gif
/pool2/scenic/janloop.jpg
/pool2/scenic/grobacro.jpg
/pool2/scenic/fnlgld.jpg
/pool2/scenic/bells.gif
/pool2/scenic/Eilean_Donan.gif
/pool2/scenic/Kilchurn_Castle.gif
/pool2/scenic/Plockton.gif
/pool2/scenic/Tantallon_Castle.gif
/pool2/scenic/SouthStockholm.jpg
/pool2/scenic/BlackRock_Cottage.jpg
/pool2/scenic/seward.jpg
/pool2/scenic/canadian_rockies_csg110_EmeraldBay.jpg
/pool2/scenic/canadian_rockies_csg111_RedRockCanyon.jpg
/pool2/scenic/canadian_rockies_csg112_WatertonNationalPark.jpg
/pool2/scenic/canadian_rockies_csg113_WatertonLakes.jpg
/pool2/scenic/canadian_rockies_csg114_PrinceOfWalesHotel.jpg
/pool2/scenic/canadian_rockies_csg116_CameronLake.jpg
/pool2/scenic/Castilla_Spain.jpg
/pool2/scenic/Central-Park-Walk.jpg
/pool2/scenic/CHANNEL.JPG
In my best Hugh Laurie voice trying to sound very Northeastern American, that is so cool! But we're not even
done yet. Let's take this list of files and restore them - in this case, from pool1. Operationally this
would be from a back up tape or nearline backup cache, but for our purposes, the contents in pool1 will
do nicely.
First, let's clear the zpool error counters and return the spare disk. We want to make sure
that our restore works as desired. Oh, and clear the FMA stats while we're at it.
# zpool clear
# zpool detach pool2 spare1
# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset
# fmadm reset zfs-retire
fmadm: zfs-retire module has been reset
Now individually restore the files that have errors in them and check again. You can even export and reimport
the pool and you will find a very nice, happy, and thoroughly error free ZFS pool. Some rather unpleasant gnashing of
zpool status -v output with awk has been omitted for sanity sake.
# zpool scrub pool2
# zpool status pool2
pool: pool2
state: ONLINE
scrub: scrub completed with 0 errors on Mon Feb 18 14:04:56 2008
config:
NAME STATE READ WRITE CKSUM
pool2 ONLINE 0 0 0
mirror ONLINE 0 0 0
disk3 ONLINE 0 0 0
disk4 ONLINE 0 0 0
spares
spare1 AVAIL
errors: No known data errors
# zpool export pool2
# zpool import pool2
# dircmp -s /pool1 /pool2
Conclusions and Review
So what have we learned ? ZFS and FMA are two great tastes that taste great together. No, that's chocolate
and peanut butter, but you get this idea. One more great example of Isaac's Multiplicity of Solaris.
That, and I have finally found a good lab exercise for the FMA training materials. Ever since Christine Tran put
the FMA workshop together, we have been looking for some good FMA lab exercises. The materials reference a synthetic
fault generator that is not available in public (for obvious reasons). I haven't explored the FMA test harness
enough to know if there is anything in there that would make a good lab. But this exercise that we have just
explored seems to tie a number of key pieces together.
And of course, one more reason why Roxy says,
"You should run Solaris."
Technocrati Tags:
Sun
Solaris
ZFS
FMA
Netherton

Sunday February 17, 2008
And Trixie says .....
Get rid of Roxy.....
And make me the peanut butter and banana sandwich!
Roxy says "You Should Run Solaris"
Look into my eyes.
You want to run Solaris.
You want to run Solaris.
You want to run Solaris.
Repeat after me
You want to run Solaris.
And make me a peanut butter and banana sandwich!

Wednesday October 03, 2007
LIve Upgrade from Solaris 10 11/06 to 8/07 without nonglobal zones
Live Upgrade is one of the most useful Solaris features, yet in my travels around the US I still don't see it used as much as I would like. I can think of several reasons for this - not all of them totally valid
- I tried it once a long time ago and a patch or package that wasn't LU aware messed up my current boot environment. Not valid for Solaris components although we do see the occasional partner product with this problem. The last one I saw was the NVidia driver,
and the good folks from NVidia fixed it very quickly once reported.
- The documentation can be a bit intimidating. Valid with a capital V. But Live Upgrade is an amazingly flexible feature, so at some point you do have to describe these capabilities. As a guide through this documentation, several folks have blogged managable howto guides. You can find mine back in March 2007, although I've recently updated it. And there are other good blogs with plentry of examples. There is a very good Blueprint on Live Upgrade.
- It doesn't work with the Veritas Volume Manager.
- I didn't know about Live Upgrade. Well, you do now. But I have noticed that a lot of the Solaris conversation is focused on new features, like ZFS, Zones, SMF, DTrace and some of the older features like Flash archives and Live Upgrade don't receive the attention they deserve. The simple fact is that Live Upgrade takes all of the pain out of the patching process, at least once you know what to patch.
And I'm sure there are other reasons, but these are the ones I hear most often.
Let's turn our attention to the topic at hand, upgrading a Solaris 10 11/06 system to 8/07, without zones. This example will be on an x64
system, but the SPARC approach is simular.
If you have read my earlier
blog on Live Upgrade, you will recall the process is
- Read Infodoc Infodoc 72099 and install any required patches
- Install the LU packages SUNWluu SUNWlur and SUNWlucfg (if present) from the installation media
- lurename(1m) if you want to change the name of your new boot environment
- lumake(1m) or ludelete(1m) + lucreate(1m) to repopulate the target boot environment with the proper software and configuration files
- luupgrade(1m) to upgrade the target boot environment
- luactivate(1m) to activate the new boot environment
- init 0 to perform the file synchronization and conversions, create the new boot archive and update your GRUB menu
So I fire up my web browser and run over to
SunSolve to pick up
Infodoc 72099 and see a rather large set of patches. And there are two lists, one for systems with non-global
zones and one without. Since we're looking at a system without non-global zones we will start with the shorter of the two lists (the next
article will cover systems with nonglobal zones).
Apparently we need patches
Solaris 10 x86 118816-03 or higher nawk patch
Solaris 10 x86 120901-03 or higher libzonecfg patch
Solaris 10 x86 121334-04 or higher SUNWzoneu required patch
Solaris 10 x86 119255-42 or higher patchadd/patchrm patches
Solaris 10 x86 119318-01 or higher SVr4 Packaging Commands (usr) Patch
Solaris 10 x86 117435-02 or higher biosdev patch for GRUB Boot
Reboot after installation
Solaris 10 x86 120236-01 or higher SUNWluzone required patches
Solaris 10 x86 121429-08 or higher SUNWluzone required patches
Solaris 10 x86 121003-03 or higher pax patch
Solaris 10 x86 123122-02 or higher prodreg patch
Solaris 10 x86 121005-03 sh patch
Solaris 10 x86 119043-10 /usr/sbin/svccfg patch
Solaris 10 x86 121902-02 i.manifest r.manifest class action script patch
Solaris 10 x86 120901-03 libzonecfg patch
Solaris 10 x86 120069-03 telnet security patch
Solaris 10 x86 120070-02 cpio patch
Solaris 10 x86 123333-01 tftp patch
Hmmm, seems like a lot of patches and a required reboot! So I fire up our new friend updatemanager to patch my system.
I see that there is a new updatemanager patch available (121119-13), so I installed that one all by itself
and restarted updatemanager.
I soon realize that my choice of patching tools is making this a bit challenging. Users of patch tools such
as
Patch Check Advanced(PCA) may have an easier time, but I
was determined to do this with updatemanager, with occasional help from the patch READMEs in SunSolve.
The list of patches required for this upgrade applies to any release of Solaris 10. A fresh install of a Solaris 10 11/06 system only needed the following four patches - which is a lot better than I first thought.
119255-42
121429-08
126539-01 as it replaces the required 121902-02
125419-01 as it replaces the required 120069-03
The difficulty with updatemanager was with the set of obsoleted patches. Something like the required 121902-02 that was
obsoleted by 126539-01 which was installed took a bit of manual trolling through patch READMEs. So I'll save you the
research - it came down to only the four above patches.
One important note: the required reboot after patch 117435-02 wasn't needed after all - so I'll try to save
all of you Solaris 10 11/06 users one reboot. While I have your attention, it is a good idea, if not a best practice, to
install patch and packaging patches separately.
Feeling a lot better about this process, I proceed and install the four required patches using updatemanager in two steps
(119255-42 and then the other three patches) and all succeeded, as expected. All that was left to do was finish the
standard procedure
# mount -o ro -F hsfs `lofiadm -a /export/iso/s10u4/solarisdvd.iso` /mnt
# pkgadd -d /mnt/Solaris_10/Product SUNWlur SUNWluu SUNWlucfg
# lurename -e nv71 -n s10u4
# lumake -n s10u4
# luupgrade -u -s /mnt -n s10u4
# luactivate s10u4
# init 0
And all went as expected. Next time I will tackle the longer list of patches and examine the same upgrade path, but with nonglobal zones.
Technocrati Tags:
Sun
Solaris
Upgrade
Laptop
Live Upgrade

Thursday June 21, 2007
Updated Solaris Bootcamp Presentations
I've had a great time traveling around the country talking about Solaris. It's not exactly a difficult thing - there's plenty to
talk about. Many of you have asked for copies of the latest Solaris update, virtualization overview and ZFS deep dive. Rather than have you dig through a bunch of old blog entries about bootcamps from 2005, here they are for your convenience.
I hope this will save you some digging though
http://mediacast.sun.com and tons of old blogs.
In a few weeks I'll post a new "What's New in Solaris" which will have some really cool things. But we'll save that for later.
Technocrati Tags:
Sun
Solaris
Zones
VMware
Virtualization
ZFS

Monday June 11, 2007
True Virtualization ?
While this is inspired by a recent conversation with a customer, I have seen the term "true virtualization" used
quite a bit lately - mostly by people who have just attended a VMware seminar, and to a lesser extend folks from
IBM trying to compare LPARS with Solaris zones. While one must give due credit to the
fine folks at VMware for raising Information Technology (IT) awareness and putting virtualization in the common vocabulary,
they hardly have cornered the market on virtualization and using the term "true virtualization" may reveal how narrow an
understanding they have of the concept or an unfortunate arrogance that their approach is the only one that matters.
Wikipedia defines
virtualization as
a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications, or end users interact with those resources. While Wikipedia isn't the final authority, this definition is quite good and we will use it to start our exploration.
So what is true virtualization ? Anything that (potentially) hides architectural details from running objects (programs, services, operating
systems, data). No more, no less - end of discussion.
Clearly VMware's virtualization products (ESX, Workstation) do that. They provide
virtual machines that emulate the Intel x86
Instruction Set Architecture (ISA) so that operating systems think they are running on real hardware when in fact they are not. This type of virtualization would be classified as an abstraction type of virtual machines. But so is
Xen, albeit with an interesting twist.
In the case of Xen, a synthetic ISA based on the x86 is emulated removing some of the instructions that are difficult to virtualize.
This makes porting a rather simple task - none of the user space code needs to be modified and the privileged code is generally limited to parts of the kernel that actually touch the hardware (virtual memory management, device drivers). In some respects, Xen is less of an abstraction as it does allow the virtual machines to see the architectural details thus permitting specific optimizations to occur that would be prohibited in the VMware case. And our good friends at Intel and AMD are adding new features to their processors to make virtualization less complicated and higher performance so the differences in approach between the VMware and Xen hypervisors may well blur over time.
But is this true virtualization ? No, it is just one of many types of virtualization.
How about the Java Virtual Machine (JVM) ? It is a run time executive that provides a virtualized environment for a completely synthetic ISA (although real pcode implementations have been done, they are largely for embedded systems). This is the magic behind write once and run anywhere and in general the approach works very well. So this is another example of virtualization - and also an abstraction type. And given the number of JVMs running around out there - if anyone is going to claim true virtualization, it would be the Java folks. Fortunately their understanding of the computer industry is broad and they are not arrogant - thus they would never suggest such folly.
Sun4v Logical Domains (LDOMs) are a thin hypervisor based partitioning of a radically multithreaded SPARC processor. The guest domains (virtual
machines) run on real hardware but generally have no I/O devices. These guest domains get their I/O over a private channel from a service domain (a special type of domain that owns devices and contains the real device drivers). So I/O is virtualized but all other operations are executed on real hardware. The hypervisor provides resource (CPU and memory) allocation and management and the private channels for I/O (including networking). This too is virtualization, but not like Xen or VMware. This is an example of partitioning. Another example is IBM (Power) LPARS albeit with a slightly different approach.
Are there other types of virtualization ? Of course there are.
Solaris zones are an interesting type of virtualization called OS Virtualization. In this case we interpose the virtualization layer between
the privileged kernel layer the non-privileged user space. The benefit here is that all user space objects (name space, processes, address spaces) are
completely abstracted and isolated. Unlike the methods previously discussed, the kernel and underlying hardware resources are not artificially
limited, so the full heavy lifting capability of the kernel is available to all zones (subject to other resource management policies). The
trade-off for this capability is that all zones share a common kernel. This has some availability and flexibility limitations that should
be considered in a system design using zones. Non-native (Branded) zones offers some interesting flexibilities that we are just now beginning to
exploit, so the future of this approach is very bright indeed. And if I read my competitors announcements correctly, even our good friends at IBM are embracing this approach with future releases of AIX. So clearly there is something to this thing called OS Virtualization.
And there are other approaches as well - hybrids of the types we have been discussing. Special purpose libraries that either replace or interpose between common system libraries can provide some very nice virtualization capabilities - some of these transparent to applications, some not. The open source project
Wine is a good example of this. User mode Linux and it's descendants offer some abilities to run an operating system as user mode program, albeit not particularly efficiently.
QEMU is an interesting general purpose ISA simulator/translator that can be used to host non-native operating systems (such as Windows while running Solaris or Linux). The interesting thing about QEMU is that you can strip out the translation features with a special kernel module (kqemu) and the result is very efficient and nicely performing OS hosting (essentially simulating x86 running on x86). Kernel-based Virtual Machines (KVM) extends the QEMU capability to add yet another style of virtualization to Linux. It is not entirely clear at present whether KVM is really a better idea or just another not invented here (NIH) Linux project. Time will tell, but it would have been nice for the Linux kernel maintainers to take a page from OpenSolaris and embrace an already existing project that had some non-Linux vendor participation (*BSD, Solaris, Plan 9, plus some mainstream Linux distributions). At the very least it is confusing as most experienced IT professionals will associate KVM with Keyboard Video and Mouse switching products. There are other commercial products such as QuickTransit that use a similar approach (ISA translation).
And there are many many more.
So clearly the phrase "true virtualization" has no common or useful meaning. Questioning the application or definition of the phrase will likely uncover a predisposition or bias that might be a good starting point to carry on an interesting dialog. And that's always a good idea.
I leave you with one last thought. It is probably human nature to seek out the one uniform solution to all of our problems, the Grand Unification Theory being a great example. But in general, be skeptical of one size fits all approaches - while they may in fact fit all situations, they are generally neither efficient nor flattering. What does this have to do with virtualization ? Combining various techniques quite often will yield spectacular results. In other words, don't think VMware vs Zones - think VMware and Zones. In fact if you think Solaris, don't even think about zones, just do zones. If you need the additional abstraction to provide flexibility (heterogeneous or multiple version OS support) then use VMware or LDOMs. And zones.
Next time we'll take a look at abstraction style virtualization techniques and see if we can develop a method of predicting the overhead that each technique might impose on a system. Since a good apples to apples benchmark is not likely to ever see the light of day, perhaps some good old fashioned reasoning can help us make sense of what information we can find.
Technocrati Tags:
Sun
Solaris
Zones
VMware
Virtualization

Monday March 26, 2007
Securing MySQL using SMF - the Ultimate Manifest
The best way to learn the Solaris Service Management Facility (SMF) is to migrate a legacy
service. The version of MySQL that comes with Solaris is an ideal application. It is
relatively simple, has few dependencies, and can be done in just a few quick edits of
an existing manifest (utmp would be a good starting template). We cover the basic process
in the
SMF Deep Dive and various people have contributed manifests to
OpenSolaris and
Blastwave. While these are good illustrations of how easy the process is, few show what SMF can really do. The motivation for this how-to came from a
recent Solaris Bootcamp attendee who asked "what was wrong with the RC scripts the way they were ?".
Without skipping a beat.....
- Easy support of multiple service instances
- Deterministic location of service log files
- Timeouts on the start and stop methods to prevent system boots from hanging indefinetely.
- Quickly observable service state
- Flexible service dependencies
- Automatic restarting of the service upon failure
Upon closer inspection, recognizing when the service terminated and
restarting it automatically isn't that special for mysql. The mysqld_safe daemon actually performs that step, restarting
the database server if it fails. Yes, this is unique to mysql and may not exist for other services. Certianly, if
the mysqld_safe parent actually fails, SMF does provide an additional capability by automatically restarting it. But we need more.
Most of the service migration demonstrations are single instance with no downstream application dependencies - so we still need more.
The mysql service start script runs through a set of configuration files, setting variables and starting a
detached daemon, so it's highly unlikely that it will ever get stuck. Sure, it can get hacked and have bad things happen to it, but as delivered it is relatively safe. So we still need more.
The answer to the question lies in security.
SMF provides a rich set of security features that demonstrate the power of Solaris Role Based Access
Control (RBAC) and least privilege. Contrary to what you might think, these features are quite
easy to use - once you learn a few simple concepts. This is how we will answer the question
"what was wrong with the RC scripts the way they were?".
Authorizations
One of the most useful applications of RBAC is to create adminstration and operations roles.
While the details of these roles will vary from customer to customer, the common theme is
that operator roles should be able to start and stop a service in a safe manner and an
administrative role should be able to modify service properties (of which some of those may
be the ability to start or stop the service).
Historically this has been accomplished by third party security software inserting itself
all over the kernel (sometimes in a manner that makes upgrades or maintenance difficult) or custom
scripts that make use of setuid(2). Solaris 10 can perform many of these functions with just a
few entries to some configuration files, and SMF makes this process extremely easy.
You can get lots of valuable information on Solaris Security features (roles, profiles, auths,
privileges) at the
OpenSolaris Security
Community. As you navigate the wealth of white papers, ARC cases, and how-to examples, think
of Solaris authorizations as the magic that makes this possible (or more precisely simple).
In a sentence, auths are labels that a privileged application uses to restrict access to it's features.
In our case the privileged applications are
svcadm(1M) and
svccfg(1M). If you read the
smf_security(5) man page (which is excellent reading) you will see that SMF provides several authorizations.
- solaris.smf.manage - ability to start and stop any SMF managed service (good - but not what we are looking for)
- action_authorization (in the general property group) - allows a non-root user to run the methods (start, stop, and refresh)
- value_authorization (any property group) - change properties in the property group (such as general/enabled)
- modify_authorization (any property group) - add or delete properties in the property group
Now this is getting interesting. So it appears that we can use either the action or modify authorization
for the operator role. So which one do we use ?
The action_authorization would only allow running the method but not modifying any of the properties. The
implication is that you can do
# svcadm enable -t mysql
but not
# svcadm enable mysql
The difference between the two commands is that enable without -t will try to set the property general/enabled to true in additional to running the start method. This would require the value_authorization. But value_authorization will allow you to change (almost) any property in the
property group (in this case the general property group), so let's see what else value_authorization will
let you do.
# svcprop -p general ssh
general/enabled boolean true
general/action_authorization astring solaris.smf.manage.ssh
general/entity_stability astring Unstable
general/single_instance boolean true
Hmmm, the only properties that might be abused would be the authorizations, but those require additional authorizations (solaris.smf.modify) to change. So it would seem that value_authorization would be safe for an operator role - unorthodox perhaps, but safe.
modify_authorization would allow the creation of other service properties, and if limited to the general
property group might be confusing, but relatively harmless - unless of course we add a new general property later.
For this reason, modify_authorization would not be a good canidate for an operator role.
So which authorization to use ? Use action_authorization if you want a user (or role) to be able to start and
stop the service, but not make the change permanent. This is the most common case. Use value_authorization in the general property group if you want that user or role to be able to permanently turn a service on or off - this
is generally an adminstrative role.
Let's put this all together.
Start with your existing SMF manifest for MySQL. If you don't have one, you can use mine at
http://blogs.sun.com/resources/bobn/mysql.xml or Keith Lawson's
contributed MySQL manifest at the
OpenSolaris SMF Contributed Manifests and Methods page.
Add the following section
<property_group name='general' type='framework'>
<propval name='action_authorization' type='astring' value='mysql.operator' />
<propval name='value_authorization' type='astring' value='mysql.administrator' />
</property_group>
Import the new manifest by the method of your choice (svccfg import, /lib/svc/method/manifest-import, or reboot)
and your new MySQL can be managed by auths. So how to we get those auths assigned to users (or roles ?).
Authorizations are granted to users and roles by the configuration file /etc/user_attr. You can read the
user_attr(4) man page for all of the details, but the process is to add auths=mysql.operator to the user
or role entry. For example
# grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator
It is possible that a user or role may not be present in /etc/user_attr. In that case just add
a line like the one above and assign the appropriate auth.
Let's see all of this in action.
% auths
mysql.operator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable
% svcadm enable -t mysql
% svcs mysql
STATE STIME FMRI
online 15:51:02 svc:/application/mysql:default
So far so good.
% svcadm enable mysql
svcadm: svc:/application/mysql:default: Permission denied.
Why did this fail ?
% svcprop -p general mysql
general/enabled boolean true
general/action_authorization astring mysql.operator
general/entity_stability astring Unstable
general/single_instance boolean true
general/value_authorization astring mysql.administrator
Because enable also tries to set the general/enabled property - and that requires value or modify authorization. Change my user definition in /etc/user_attr
% grep ^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator,mysql.administrator
% auths
mysql.operator,mysql.administrator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.*,solaris.network.hosts.*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable
% svcadm enable mysql
% svcs mysql
STATE STIME FMRI
online 16:10:37 svc:/application/mysql:default
This is all very cool - but we can still do more.
Removing Root from the Equation
For both simplicity and compatibility with other operating systems, the MySQL service is started by
a script that is run as root. This script is generally linked into /etc/rc3.d, but since we have converted
it to an SMF service we have many more options. We have already looked at delegated administration using
auths, time to turn our attention to privileges.
# /etc/sfw/mysql/mysql.server start
# ps -ef | grep mysqld | grep -v grep
mysql 1975 1955 0 21:43:17 pts/8 0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --user=mysql --pid
root 1955 1 0 21:43:17 pts/8 0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa
# /etc/sfw/mysql/mysql.server stop
This suggests two immediate questions. Does the parent mysqld_safe really have to run as root, or can it be
started as a lesser privileged user ? If it can run as a non-root user, exactly what privileges are required to run mysql ?
The answer to the first question is simple: it can be run as a regular user. It only runs as root
out of convenience to operating systems that don't have as sophisticated a security framework as Solaris.
# su - mysql
Sun Microsystems Inc. SunOS 5.11 snv_57 October 2007
$ sh /etc/sfw/mysql/mysql.server start
$ /usr/sfw/bin/mysqladmin status
Uptime: 1174 Threads: 1 Questions: 1 Slow queries: 0 Opens: 6 Flush tables: 1 Open tables: 0 Queries per second avg: 0.001
$ sh /etc/sfw/mysql/mysql.server stop
Killing mysqld with pid 1975
Wait for mysqld to exit done
$ exit
#
Now that we have established the fact that a fully privileged user isn't required to run MySQL, what privileges are
are really required ? How far can we restrict the mysql user ? Glenn Brunette's privilege debugger privdebug.pl is the perfect tool to help us answer this question.
# privdebug.pl -f -v -e "su - mysql /usr/sfw/sbin/mysqld_safe --user=mysql"
STAT TIMESTAMP PPID PID PRIV CMD
USED 2005619300419 2211 2212 proc_taskid su
USED 2005620883559 2211 2212 proc_setid su
USED 2005621147993 2211 2212 proc_setid su
USED 2005621161490 2211 2212 proc_setid su
USED 2005621165094 2211 2212 proc_setid su
USED 2005630560973 2211 2212 proc_exec su
Starting mysqld daemon with databases from /var/mysql contract_event
USED 2005679230394 2211 2212 proc_fork sh
USED 2005750348321 2211 2212 proc_fork sh
USED 2005751386190 2212 2214 proc_exec sh
USED 2005756249415 2211 2212 proc_fork sh
USED 2005757238096 2212 2215 proc_fork sh
USED 2005758495289 2212 2215 proc_exec sh
USED 2005761778059 2211 2212 proc_fork sh
USED 2005762623018 2212 2217 proc_fork sh
USED 2005763874569 2212 2217 proc_exec sh
USED 2005767441408 2211 2212 proc_fork sh
USED 2005768337263 2212 2219 proc_exec sh
USED 2005772916576 2211 2212 proc_fork sh
USED 2005773996432 2212 2220 proc_fork sh
USED 2005775465400 2212 2220 proc_exec sh
USED 2005778750305 2211 2212 proc_fork sh
USED 2005779846375 2212 2222 proc_exec sh
USED 2005782042348 2211 2212 proc_fork sh
USED 2005783110622 2212 2223 proc_exec sh
USED 2005785636236 2211 2212 proc_fork sh
USED 2005786824801 2212 2224 proc_exec sh
USED 2005788593079 2212 2224 proc_exec nohup
USED 2005790693138 2212 2224 proc_exec nohup
USED 2005792812264 2211 2212 proc_fork sh
USED 2005794010658 2212 2225 proc_exec sh
USED 2005795756145 2212 2225 proc_exec nohup
USED 2005797704273 2212 2225 proc_exec nohup
NEED 2005799674735 2211 2212 file_dac_write sh
USED 2005800708905 2211 2212 proc_fork sh
USED 2005801869396 2212 2226 proc_exec sh
USED 2005804780370 2211 2212 proc_fork sh
USED 2005805854317 2212 2227 proc_exec sh
USED 2005807860051 2211 2212 proc_fork sh
USED 2005808907677 2212 2228 proc_exec sh
USED 2005811293197 2211 2212 proc_fork sh
USED 2005812393916 2212 2229 proc_exec sh
USED 2005814589669 2212 2229 proc_exec nohup
USED 2005816674186 2212 2229 proc_exec nohup
STOPPING server from pid file /var/mysql/pandora.pid contract_event
070325 22 11 mysqld ended 18 contract_event
Ignore the proc_taskid and proc_setid, they are artifacts of using su(1M) to run the database server as user
mysql. We see that mysqld only needs proc_fork and proc_exec. The file_dac_write failure comes from a call to
access(2) and is not needed for proper operation.
What do we do with what we have just learned ?
Referring to the smf_method(5) man page (another excellent read), it seems that all we need to do is add a
method_credential option to the various methods (start, stop, and refresh). The appropriate section of my new and improved MySQL manifest now looks like
<exec_method type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='60'>
<method_context>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='120'>
<method_context>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart' timeout_seconds='120'>
<method_context>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
So we quickly modify our manifest and import it using one of the standard methods (svccfg import, /lib/svc/method/manifest-import, or a reboot) and we should be done, right ? Well...... not exactly - but we're close.
% svccfg enable mysql
% svcs mysql
STATE STIME FMRI
maintenance 21:53:37 svc:/application/mysql:default
$ tail -5 `svcprop -p restarter/logfile mysql`
[ Mar 26 21:51:12 Method "stop" exited with status 0 ]
[ Mar 26 21:53:36 Enabled. ]
[ Mar 26 21:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
svc.startd could not set context for method: chdir: No such file or directory
[ Mar 26 21:53:37 Method "start" exited with status 96 ]
Doh! When we followed the MySQL installation instructions at /etc/sfw/mysql/README.solaris.mysql
we created a user account called mysql. But we didn't specify a home directory, did we ? No - so
the default template value of /home/mysql was used. But there is no /home/mysql, is there ? Well, no.
How do we fix this ?
Set a reasonable home directory for the mysql user. How about /var/mysql ?
Elsewhere in the installation instructions we did set ownership and proper permissions to this
directory - so that would seem like a reasonable home directory.
As root
# usermod -d /var/mysql mysql
That is one solution, but it may not be practical for all cases. Perhaps a better idea would be
to provide a working directory for each of the methods. The benefit is that I could set it differently for each
service instance. This would be done in the method_context tag for the method. So I modify my service
manifest to look like
<exec_method type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='60'>
<method_context working_directory='/var/mysql'>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m' timeout_seconds='120'>
<method_context working_directory='/var/mysql'>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
<exec_method type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart' timeout_seconds='120'>
<method_context working_directory='/var/mysql'>
<method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec' />
</method_context>
</exec_method>
Reimport the manifest and let's see how things go.
# svccfg import /var/svc/manifest/application/mysql.xml
# svcadm clear mysql
# svcs mysql
STATE STIME FMRI
maintenance 22:17:49 svc:/application/mysql:default
Argh - now what ?
# tail -5 `svcprop -p restarter/logfile mysql`
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]
[ Mar 26 22:17:49 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]
Doh! Since Solaris delivers MySQL as a legacy service the start script doesn't have execute
permissions for the mysql user. That's easy to fix.
# ls -l /etc/sfw/mysql/mysql.server
-rwxr--r-- 1 root sys 5655 Mar 22 17:05 /etc/sfw/mysql/mysql.server
# chown mysql /etc/sfw/mysql/mysql.server
# svcadm clear mysql
# svcs mysql
STATE STIME FMRI
online 22:23:08 svc:/application/mysql:default
bash-3.00$
Now that's more like it. One last item to check.
# ps -ef | grep mysqld | grep -v grep
mysql 12656 12634 0 22:23:11 ? 0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --pid-file=/var/my
mysql 12634 1 0 22:23:09 ? 0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa
# ppriv 12634
12634: /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var
flags =
E: basic,!file_link_any,!proc_info,!proc_session
I: basic,!file_link_any,!proc_info,!proc_session
P: basic,!file_link_any,!proc_info,!proc_session
L: all
Now that's what I wanted to see. The parent mysqld_safe is now running as user mysql and with
exactly the right privileges. This is very cool indeed. Armed with this information we could
also create a zone and use the limitpriv attribute to restrict the zone privilege - but we'll leave that
for another day.
Conclusion
It is quite easy to leverage not only Solaris authorizations but to run services with restricted
privileges. We have presented a few templates and a general approach that should make this process less cumbersome.
More important though - we now have a compelling reply when asked "what was wrong with the RC
scripts the way they were?"
Technocrati Tags:
Sun
Solaris
SMF

Tuesday March 20, 2007
Zones in a Flash - Literally
Fantastic improvements have been made in the Solaris installation and upgrade
process - even more in OpenSolaris (available in the various community releases).
As we
examined the cloning feature introduced in Solaris 10 11/06, it became apparent that we
have stumbled upon a most intriguing capability. When combining
zone cloning with the attach/detach capability we have discovered a model for
flashing zones: zoneflash.
In a recent boot camp we took a look at this in more detail. Unfortunately the
slides (which will be posted soon) didn't quite follow the level of depth we were
exploring. Several people asked for notes on how this works - and here they are.
The irony is that it will take longer to read about it than it does to perform
the actual process - but it is so cool.
The Promise
We start with a fresh Solaris system. In this case just live upgraded from media, but it
could have been jumpstarted from media or a flash archive. The key point here is
that the system has had very little done to it, other than naming and some
software installation. Since zone attach makes sure that key system components
(specifically packages and patches) are compatible, it makes sense to build our
flashzones on a system that will look similar to those that will be built in the
future.
So how many zones will we build ? That's a good question. If this were
system flasharchives the answer would be as few as possible - one per architecture
in the most efficient case. But these zoneflashes are different - just applications,
some metadata, and perhaps some customizations (naming, security, SMF). It seems
reasonable to create one zoneflash for each type of application server you
would deploy - think of it as a userspace template. In this example I have chosen four: a blank uncustomized flash
(for building a new zoneflash in a flash), database server (MySQL),
web server (apache2), and the community edition of webmin (just another application).
Our procedure will be to build a minimal default zoneflash, run it through first boot
to populate the SMF repository, and then clone it for the remaining zoneflashes.
Each of these will be booted, customized for the particular application, and tested to
make sure everything is operating properly.
We will then detach the zones and move the detached zoneroots onto some media that can be
transported. Of course, keeping with the theme of zones and flash, the transport
could be the flasharchive itself. How cool would it be to jumpstart a server
using flasharchives and have all the application zones already present in a known
location, such as /zoneflash ? Unfortunately, I'm sitting in seat 18A on an
American Airlines flight to Los Angeles and don't quite have the required infrastructure
to do that sort of test. But I do have a USB stick and multiple boot environments.
That will do nicely.
Once attached, we will clone the zoneflashes as necessary, adding resources (network, local filesystems)
and attributes (resource controls) required for the proper operation of the application. When finished we
will detach the zoneflashes so they may be used elsewhere.
The Turn
The first step is to build and boot a simple generic sparse root zone. Since this zone isn't really meant for
operation, most zonecfg attributes (network configuration, resource limits, et al) will be
skipped. We will add them later when we build the real zones - remember, these are just
user space application templates.
# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create
zonecfg:flashdefault> set zonepath=/local/default
zonecfg:flashdefault> add inherit-pkg-dir; set dir=/opt; end
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
#
# zoneadm -z default install
A few minutes later we have an installed zone, ready for first boot. Since I've attended my Solaris Zones
Best Practices class, or at least read the materials, I know how to build a sysidcfg file that will satisfy
the sysidtool first boot service. This will allow the zone to boot up all the way without any additional
console interaction. Let's do that for our new zone.
# echo > /local/default/root/etc/sysidcfg <<EOF
name_service=NONE
nfs4_domain=dynamic
security_policy=NONE
root_password=xxxxxxxxxx You supply your own encrypted string from /etc/shadow, I'm not going to post mine!
system_locale=C
terminal=ansi
timezone=US/Central
network_interface=NONE {hostname=default}
EOF
# zoneadm -z default boot
# zlogin -C default
We need to let first boot processing complete. Since we supplied a valid sysidcfg, it is just
a matter of waiting for manifest-import and sysidtool to complete their magic. When complete,
login in and take a look around to make sure all is well. Once satisfied, shut down the zone
(either from inside the zone or from the global zone) - we are through with it for now.
(from the global zone)
# zoneadm -z default halt
Now we are done with this first zone. Time to clone it for our remaining application zones.
Please pardon a bit of inline shell scripting - I hate to type the same thing over and over and over.
Sort of makes for a nice script template, doesn't it ? Not quite the sophistication of Brad Digg's
zonemanager, but it will do nicely for our example.
# for zone in webmin mysql web
? do
echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
zoneadm -z ${zone} clone default
echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
zoneadm -z ${zone} boot
done
#
What in the heck was that all about ? OK, one more time - line by line with annotation.
# for zone in webmin mysql web
do
A quick interactive loop for the creation of three application zones. The variable ${zone} will be set to
the name of the zone we are trying to construct.
echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
A one liner that creates a new zone configuration based on the already existing
default. At this point the only thing we need to change is the zonepath, and it should be
set to /local/${zone}.
zoneadm -z ${zone} clone default
We recognize this as a zone cloning operation. The zone root is copied and a /reconfigure is
created in the new zone root so that sysidtool performs a complete configuration on first boot.
If you happen to be running on a recent release of OpenSolaris, you can put your zoneroot on
ZFS and the cloning operating will only take a few seconds and very little additional disk space will
be required. Those of us on Solaris 10 11/06 will have to wait for the 160MB or so to be copied.
Still better than the 9 minutes to go through a complete zone installation.
echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
This step creates a custom sysidcfg file for each zone. Remember to supply your own root
password from /etc/shadow in the global zone. This answers all of the sysidtool
questions, including the NFSv4 question.
zoneadm -z {zone} boot
Boot the zone. If we have done everything correctly, the next interaction will be
with console login.
done
Close the for loop in the interactive script. This process will take a few minutes
on Solaris 10 11/06, or if we are being clever with OpenSolaris and ZFS - a few seconds.
Now for the hard part - customizing the individual application zones. Well, it's not
all that difficult. And if you do this regularly, you probably have scripts to do
most of the work. It's just individual application installation and customization.
Here is what I did for my example zones.
MySQL
The installation instructions for the Solaris 10 MySQL can be found in /etc/sfw/mysql/README.solaris.mysql.
There is a typo in the Solaris 10 version of the README. It will cause a lot of grief if you cut and
paste without looking at the results. Fortunately it has been corrected in nevada (aka OpenSolaris
Community Edition).
Boot the mysql zone and log in as root.
# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql This line is incorrect in the Solaris 10 README - my chmod works better with two arguments
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf
The installation instructions continue by linking the start script into /etc/rc3.d. Since we are big SMF fans in these parts, let's do that instead. Feel free to use
my MySQL manifest as it contains a couple of
cool features (value and action authorizations - more on that later).
Since the mysql zone doesn't have any networking configured, perform this next step from the global zone.
If you already have a suitable manifest, or have stashed mine away somewhere in the global zone you can
use that instead.
# cd /local/mysql/root/var/svc/manifest/application
# wget http://blogs.sun.com/bobn/resource/mysql.xml
It's probably a good idea to make sure that all of this is working properly. Either reboot the
mysql zone, run the manifest-import service manually, or run svccfg import on the new manifest.
Your choice. What you should see upon completion is
# svcs mysql
STATE STIME FMRI
online 14:41:19 svc:/application/mysql:default
# /usr/sfw/bin/mysqladmin status
Uptime: 459 Threads: 1 Questions: 2 Slow queries: 0 Opens: 6 Flush tables: 1 Open tables: 0 Queries per second avg: 0.004
We're done for now. Unless of course you want to go for some extra credit. In that case
- Set up a web server with PHP support. Apache 1 plus the SFWmphp package from the Solaris Companion
will do just fine.
- Download and unpack phpMyAdmin in the webserver htdocs directory.
- Create a user with the mysql.operator authorization
- Create a user with the mysql.administrator authorization
Shut down the mysql zone.
Web
This is about as easy as it gets. Boot the web zone and perform the following steps.
# cp /etc/apache2/httpd.conf-example /etc/apache2/httpd.conf
# svcadm enable apache2
A quick check to make sure all is well.
# svcs apache2
STATE STIME FMRI
online 17:17:41 svc:/network/http:apache2
# telnet localhost 80
Trying ::1...
telnet: connect to address ::1: Network is unreachable
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>501 Method Not Implemented</title>
Connection to localhost closed by foreign host.
We're done for now. Shut down the web zone.
Webmin
This one is a little more complicated. We did this one last time in the zone cloning, but it is worth a second look.
Our task here is to replace the Solaris webmin with the latest download from
http://webmin.com The technique we are using will allow us to install a custom version of
an application into a sparse root zone. Specifically, webmin.com's package installs into /opt/webmin, but
/opt is a read-only inherited-pkg-dir. The easiest solution for this would be the creation of a symbolic
link in the global zone /opt to point to a location that can be safely written by each non-global zone.
In my example that would be /local-pkgs.
In the global zone, create the link in /opt, create the local package directory in the webmin zoneroot, and
download the latest webmin package.
# ln -s ../local-pkgs/webmin /opt/webmin
# mkdir -p /local/webmin/root/local-pkgs/webmin
# cd /local/webmin/root/var/tmp
# wget http://prdownloads.sourceforge.net/webadmin/webmin-1.330.pkg.gz
# gunzip webmin-1.330.pkg.gz
Now boot the webmin zone and log in as root.
# zoneadm -z webmin boot
# zlogin webmin
Remove the Solaris webmin packages (SUNWwebminu SUNWwebminr). The usr package needs to be
removed twice - the first pkgrm will leave it as a partially installed package, the second
will completely remove it - at least as far as our zone (and future patching) is concerned.
Once removed, install the webmin.com version, which should be conveniently located in /var/tmp.
# pkgrm SUNWwebminu SUNWwebminr SUNWwebminu
# pkgadd -d /var/tmp/webmin-1.330.pkg
We are done with this zone. Shut it down.
Detach
We have just built four zones: an empty zone suitable for future customizations, one with the Solaris webmin
replaced by the community edition, one with a working MySQL database, and one with a webserver. The last
task to be performed on these zones in their current state is to be detached, another new feature in Solaris 10 11/06.
Zone detach will copy the zone configuration into the zoneroot (to be used with a subsequent zone attach)
and sets the current zone state to configured. You can even delete the zone configurations as a final
cleanup prior to building a flash archive.
# zoneadm -z default detach
# zoneadm -z webmin detach
# zoneadm -z mysql detach
# zoneadm -z web detach
# zonecfg -z default delete -F
# zonecfg -z webmin delete -F
# zonecfg -z mysql delete -F
# zonecfg -z web delete -F
And flash
Unless the person in 18B wants to be a jumpstart server, we will have to simulate jumpstart/flasharchive
process. We can do this by booting into an alternate boot environment and then
delivering the detached zoneroots by some sort of shared or removable storage - something like a USB memory stick.
When we are done with this exercise, our zoneflashes will still be on the memory device, ready for their next use. Since the zones will never be booted, just cloned, the speed of the memory device really isn't important.
We need to prepare the USB memory stick (currently formatted as FAT16). We will use rmformat -l to
locate the device, fdisk to put a proper label on it, finally newfs for installing a proper file system.
ZFS would be interesting, but it would just get in our way later.
# rmformat -l
Looking for devices...
1. Logical Node: /dev/rdsk/c2t0d0p0
Physical Node: /pci@0,0/pci1179,1@1d,7/storage@4/disk@0,0
Connected Device: USB DISK 2.0 PMAP
Device Type: Removable
Bus: USB
Size: 984.0 MB
Label:
Access permissions:
2. Logical Node: /dev/rdsk/c1t0d0p0
Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
Connected Device: TEAC DW-224E-A 7.2A
Device Type: CD Reader
Bus: IDE
Size:
Label:
Access permissions:
# fdisk /dev/rdsk/c2t0d0p0
3 (to delete the existing partition)
1 (to create a new Solaris partition)
5 (to exit and write the new label)
# newfs /dev/rdsk/c2t0d0s2
newfs: construct a new file system /dev/rdsk/c2t0d0s2: (y/n)? y
/dev/rdsk/c2t0d0s2: 2009088 sectors in 981 cylinders of 64 tracks, 32 sectors
981.0MB in 62 cyl groups (16 c/g, 16.00MB/g, 7680 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 32832, 65632, 98432, 131232, 164032, 196832, 229632, 262432, 295232,
1705632, 1738432, 1771232, 1804032, 1836832, 1869632, 1902432, 1935232,
1968032, 2000832
# mkdir /tmp/flash
# mount /dev/dsk/c2t0d0s2 /tmp/flash
# cd /local
# find default webmin web mysql -print | cpio -pdum /tmp/flash
# umount /tmp/flash
We are now done with the original system. At this point we would create a flasharchive (with the detached
zoneroots in a convenient place in the archive).
The Prestige
The final act in our magic trick is the delivery. Specifically the transport, reattachment, and subsequent cloning of the zoneflashes on a new system. 18B is now asleep and I really don't want to disturb him, so I'll do this part myself. I'll boot my
laptop into another boot environment - built from the same media using the same Live Upgrade method as the boot environment
that created the zones.
We begin by mounting the removable media (USB memory stick) that contains the zoneflash. Do take a look around, it is quite likely that our friend volfs has already done this for us. Remember - if we were using a flasharchive to deliver the zoneflash this step would be unnecessary.
# mkdir /flash
# mount /dev/dsk/c2t0d0s2 /flash (we used rmformat -l to derive the device name)
Now that our zoneflashes have arrived, time to reattach them. The first step is to create zone configurations. If you recall, these were stored in the zoneroot when they were detached. The zonecfg command create -a is used to
retrieve the stored configuration information and adapt it to the new system - specifically the new location
of the zoneroot. Once configured we use zoneadm attach to reconnect them.
The sequence to reattach our default zone, now called flashdefault, would look something like this.
# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create -a /flash/default
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
# zoneadm -z flashdefault attach
We'll be a little more clever attaching the other three zones.
# for zone in webmin web mysql
do
echo "create -a /flash${zone}" | zonecfg -z flash${zone}
zoneadm -z flash${zone} attach
done
At this point our zoneroots are still on the USB memory device - but don't worry, these zones will never be booted. Their only purpose is to deliver preconfigured zones. We will use zone cloning to create our real application zones.
Which we will now do. It is very convenient to use the flashzone as a template for our new zone in case there were some special attributes like limitpriv that we might want to preserve. We will also need to add items that were not present in the zoneflashes - specifically networking and local file systems. Once we are satisfied with the zone configurations we
will clone the zoneflash. If we are only building one of each type of zone we can detach the zoneflash so that other
administrators can use it on their systems.
Let's do this for the mysql zone.
# zonecfg -z mysql
mysql: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:mysql> create -t flashmysql
zonecfg:mysql> set zonepath=/zones/mysql
zonecfg:mysql> add net; set physical=e1000g0; set address=192.168.100.102/24; end
zonecfg:mysql> add fs; set dir=/export; set special=/export; set options=[rw,nosuid,nodevices]; set type=lofs; end
zonecfg:mysql> commit
zonecfg:mysql> exit
# zoneadm -z mysql clone flashmysql
Copying /flash/mysql...
# zoneadm -z flashmysql detach
# echo "name_service=NONE" > /zones/mysql/root/etc/sysidcfg
# echo "nfs4_domain=dynamic" >> /zones/mysql/root/etc/sysidcfg
# echo "security_policy=NONE" >> /zones/mysql/root/etc/sysidcfg
# echo "root_password=xxxxxxxxxxx" >> /zones/mysql/root/etc/sysidcfg
# echo "system_locale=C" >> /zones/mysql/root/etc/sysidcfg
# echo "network_interface=NONE {hostname=mysql}" >> /zones/mysql/root/etc/sysidcfg
# echo "terminal=ansi" >> /zones/mysql/root/etc/sysidcfg
# echo "timezone=US/Central" >> /zones/mysql/root/etc/sysidcfg
And for the finale - boot the newly flashed mysql zone and you should see an enabled and operating mysql service.
# zoneadm -z mysql boot
# zlogin -C mysql
[Connected to zone 'mysql' console]
Hostname: mysql
Creating new rsa public/private host key pair
Creating new dsa public/private host key pair
Mar 20 06:15:44 mysql sendmail[1719]: My unqualified host name (mysql) unknown; sleeping for retry
Mar 20 06:15:44 mysql sendmail[1722]: My unqualified host name (mysql) unknown; sleeping for retry
mysql console login: root
Password:
Last login: Mon Mar 19 17:10:10 on console
Mar 20 06:15:49 mysql login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.11 snv_57 October 2007
#
# svcs mysql
STATE STIME FMRI
online 6:31:28 svc:/application/mysql:default
# /usr/sfw/bin/mysqladmin status
Uptime: 8 Threads: 1 Questions: 1 Slow queries: 0 Opens: 6 Flush tables: 1 Open tables: 0 Queries per second avg: 0.125
How cool is that ? Not only did we clone the zone, but since the database is in /var, it was cloned as well. Perhaps not practical for every situation, but still pretty cool.
I will leave the flashing of defau