How Cool is That ?
Bob Netherton's Weblog
« Previous day (May 20, 2009) | Main | Next day (May 21, 2009) »
20090521 Thursday May 21, 2009
Getting Rid of Pesky Live Upgrade Boot Environments
As we discussed earlier, Live Upgrade can solve most of the problems associated with patching and upgrading your Solaris system. I'm not quite ready to post the next installment in the LU series quite yet, but from some of the comments and email I have received, there are two problems that I would like to help you work around.

Oh where oh where did that file system go ?

One thing you can do to stop Live Upgrade in its tracks is to remove a file system that it thinks another boot environment needs. This does fall into the category of user error, but you are more likely to run into this in a ZFS world where file systems can be created and destroyed with great ease. You will also run into a varient of this if you change your zone configurations without recreating your boot environment, but I'll save that for a later day.

Here is our simple test case:
  1. Create a ZFS file system.
  2. Create a new boot environment.
  3. Delete the ZFS file system.
  4. Watch Live Upgrade fail.

# zfs create arrakis/temp

# lucreate -n test
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
Comparing source boot environment <s10u7-baseline> file systems with the
file system(s) you specified for the new boot environment. Determining
which file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <test>.
Source boot environment is <s10u7-baseline>.
Creating boot environment <test>.
Cloning file systems from boot environment <s10u7-baseline> to create boot environment <test>.
Creating snapshot for <rpool/ROOT/s10u7-baseline> on <rpool/ROOT/s10u7-baseline@test>.
Creating clone for <rpool/ROOT/s10u7-baseline@test> on <rpool/ROOT/test>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/test>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <s10u6_baseline> as <mount-point>>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <test> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <nv114> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <route66> as <mount-point>//boot/grub/menu.lst.prev.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <nv95> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <test> in GRUB menu
Population of boot environment <test> successful.
Creation of boot environment <test> successful.

# zfs destroy arrakis/test

# luupgrade -t -s /export/patches/10_x86_Recommended-2009-05-14  -O "-d" -n test
System has findroot enabled GRUB
No entry for BE <test> in GRUB menu
Validating the contents of the media </export/patches/10_x86_Recommended-2009-05-14>.
The media contains 143 software patches that can be added.
All 143 patches will be added because you did not specify any specific patches to add.
Mounting the BE <test>.
ERROR: Read-only file system: cannot create mount point </.alt.tmp.b-59c.mnt/arrakis/temp>
ERROR: failed to create mount point </.alt.tmp.b-59c.mnt/arrakis/temp> for file system </arrakis/temp>
ERROR: unmounting partially mounted boot environment file systems
ERROR: cannot mount boot environment by icf file </etc/lu/ICF.5>
ERROR: Unable to mount ABE <test>: cannot complete lumk_iconf
Adding patches to the BE <test>.
Validating patches...

Loading patches installed on the system...

Cannot check name /a/var/sadm/pkg.
Unmounting the BE <test>.
The patch add to the BE <test> failed (with result code <1>).
The proper Live Upgrade solution to this problem would be to destroy and recreate the boot environment, or just recreate the missing file system (I'm sure that most of you have figured the latter part out on your own). The rationale is that the alternate boot environment no longer matches the storage configuration of its source. This was fine in a UFS world, but perhaps a bit constraining when ZFS rules the landscape. What if you really wanted the file system to be gone forever.

With a little more understanding of the internals of Live Upgrade, we can fix this rather easily.

Important note: We are about to modify undocumented Live Upgrade configuration files. The formats, names, and contents are subject to change without notice and any errors made while doing this can render your Live Upgrade configuration unusable.

The file system configurations for each boot environment are kept in a set of Internal Configuration Files (ICF) in /etc/lu named ICF.n, where n is the boot environment number. From the error message above we see that /etc/lu/ICF.5 is the one that is causing the problem. Let's take a look.
# cat /etc/lu/ICF.5
test:-:/dev/dsk/c5d0s1:swap:4225095
test:-:/dev/zvol/dsk/rpool/swap:swap:8435712
test:/:rpool/ROOT/test:zfs:0
test:/archives:/dev/dsk/c1t0d0s2:ufs:327645675
test:/arrakis:arrakis:zfs:0
test:/arrakis/misc:arrakis/misc:zfs:0
test:/arrakis/misc2:arrakis/misc2:zfs:0
test:/arrakis/stuff:arrakis/stuff:zfs:0

test:/arrakis/temp:arrakis/temp:zfs:0

test:/audio:arrakis/audio:zfs:0
test:/backups:arrakis/backups:zfs:0
test:/export:arrakis/export:zfs:0
test:/export/home:arrakis/home:zfs:0
test:/export/iso:arrakis/iso:zfs:0
test:/export/linux:arrakis/linux:zfs:0
test:/rpool:rpool:zfs:0
test:/rpool/ROOT:rpool/ROOT:zfs:0
test:/usr/local:arrakis/local:zfs:0
test:/vbox:arrakis/vbox:zfs:0
test:/vbox/fedora8:arrakis/vbox/fedora8:zfs:0
test:/video:arrakis/video:zfs:0
test:/workshop:arrakis/workshop:zfs:0
test:/xp:/dev/dsk/c2d0s7:ufs:70396830
test:/xvm:arrakis/xvm:zfs:0
test:/xvm/fedora8:arrakis/xvm/fedora8:zfs:0
test:/xvm/newfs:arrakis/xvm/newfs:zfs:0
test:/xvm/nv113:arrakis/xvm/nv113:zfs:0
test:/xvm/opensolaris:arrakis/xvm/opensolaris:zfs:0
test:/xvm/s10u5:arrakis/xvm/s10u5:zfs:0
test:/xvm/ub710:arrakis/xvm/ub710:zfs:0
The first step is to clean up the mess left by the failing luupgrade attempt. At the very least we will need to unmount the alternate boot environment root. It is also very likely that we will have to unmount a few temporary directories, such as /tmp and /var/run. Since this is ZFS we will also have to remove the directories created when these file systems were mounted.
# df -k | tail -3
rpool/ROOT/test      49545216 6879597 7546183    48%    /.alt.tmp.b-Fx.mnt
swap                 4695136       0 4695136     0%    /a/var/run
swap                 4695136       0 4695136     0%    /a/tmp

# luumount test
# umount /a/var/run
# umount /a/tmp
# rmdir /a/var/run /a/var /a/tmp

Next we need to remove the missing file system entry from the current copy of the ICF file. Use whatever method you prefer (vi, perl, grep). Once we have corrected our local copy of the ICF file we must propagate it to the alternate boot environment we are about to patch. You can skip the propagation if you are going to delete the boot environment without doing any other maintenance activities. The normal Live Upgrade operations will take care of propagating the ICF files to the other boot environments, so we should not have to worry about them at this time.
# mv /etc/lu/ICF.5 /tmp/ICF.5
# grep -v arrakis/temp /tmp/ICF.5 > /etc/lu/ICF.5 
# cp /etc/lu/ICF.5 `lumount test`/etc/lu/ICF.5
# luumount test
At this point we should be good to go. Let's try the luupgrade again.
# luupgrade -t -n test -O "-d" -s /export/patches/10_x86_Recommended-2009-05-14
System has findroot enabled GRUB
No entry for BE  in GRUB menu
Validating the contents of the media .
The media contains 143 software patches that can be added.
All 143 patches will be added because you did not specify any specific patches to add.
Mounting the BE <test>.
Adding patches to the BE <test>.
Validating patches...

Loading patches installed on the system...

Done!

Loading patches requested to install.

Approved patches will be installed in this order:

118668-19 118669-19 119214-19 123591-10 123896-10 125556-03 139100-02


Checking installed patches...
Verifying sufficient filesystem capacity (dry run method)...
Installing patch packages...

Patch 118668-19 has been successfully installed.
Patch 118669-19 has been successfully installed.
Patch 119214-19 has been successfully installed.
Patch 123591-10 has been successfully installed.
Patch 123896-10 has been successfully installed.
Patch 125556-03 has been successfully installed.
Patch 139100-02 has been successfully installed.

Unmounting the BE <test>.
The patch add to the BE <test> completed.
Now that the alternate boot environment has been patched, we can activate it at our convenience.

I keep deleting and deleting and still can't get rid of those pesky boot environments

This is an interesting corner case where the Live Upgrade configuration files get so scrambled that even simple tasks like deleting a boot environment are not possible. Every time I have gotten myself into this situation I can trace it back to some ill advised shortcut that seemed harmless at the time, but I won't rule out bugs and environment as possible causes.

Here is our simple test case: turn our boot environment from the previous example into a zombie - something that is neither alive nor dead but just takes up space and causes a mild annoyance.

Important note: Don't try this on a production system. This is for demonstration purposes only.
# dd if=/dev/random of=/etc/lu/ICF.5 bs=2048 count=2
0+2 records in
0+2 records out

# ludelete -f test
System has findroot enabled GRUB
No entry for BE <test> in GRUB menu
ERROR: The mount point </.alt.tmp.b-fxc.mnt> is not a valid ABE mount point (no /etc directory found).
ERROR: The mount point </.alt.tmp.b-fxc.mnt> provided by the <-m> option is not a valid ABE mount point.
Usage: lurootspec [-l error_log] [-o outfile] [-m mntpt]
ERROR: Cannot determine root specification for BE <test>.
ERROR: boot environment <test> is not mounted
Unable to delete boot environment.
Our first task is to make sure that any partially mounted boot environment is cleaned up. A df should help us here.
# df -k | tail -5
arrakis/xvm/opensolaris 350945280      19 17448377     1%    /xvm/opensolaris
arrakis/xvm/s10u5    350945280      19 17448377     1%    /xvm/s10u5
arrakis/xvm/ub710    350945280      19 17448377     1%    /xvm/ub710
swap                 4549680       0 4549680     0%    /.alt.tmp.b-fxc.mnt/var/run
swap                 4549680       0 4549680     0%    /.alt.tmp.b-fxc.mnt/tmp


# umount /.alt.tmp.b-fxc.mnt/tmp
# umount /.alt.tmp.b-fxc.mnt/var/run
Ordinarily you would use lufslist(1M) to try to determine which file systems are in use by the boot environment you are trying to delete. In this worst case scenario that is not possible. A bit of forensic investigation and a bit more courage will help us figure this out.

The first place we will look is /etc/lutab. This is the configuration file that lists all boot environments known to Live Upgrade. There is a man page for this in section 4, so it is somewhat of a public interface but please take note of the warning
 
        The lutab file must not be edited by hand. Any user  modifi-
        cation  to  this file will result in the incorrect operation
        of the Live Upgrade feature.
This is very good advice and failing to follow it has led some some of my most spectacular Live Upgrade meltdowns. But in this case Live Upgrade is already broken and it may be possible to undo the damage and restore proper operation. So let's see what we can find out.
# cat /etc/lutab
# DO NOT EDIT THIS FILE BY HAND. This file is not a public interface.
# The format and contents of this file are subject to change.
# Any user modification to this file may result in the incorrect
# operation of Live Upgrade.
3:s10u5_baseline:C:0
3:/:/dev/dsk/c2d0s0:1
3:boot-device:/dev/dsk/c2d0s0:2
1:s10u5_lu:C:0
1:/:/dev/dsk/c5d0s0:1
1:boot-device:/dev/dsk/c5d0s0:2
2:s10u6_ufs:C:0
2:/:/dev/dsk/c4d0s0:1
2:boot-device:/dev/dsk/c4d0s0:2
4:s10u6_baseline:C:0
4:/:rpool/ROOT/s10u6_baseline:1
4:boot-device:/dev/dsk/c4d0s3:2
10:route66:C:0
10:/:rpool/ROOT/route66:1
10:boot-device:/dev/dsk/c4d0s3:2
11:nv95:C:0
11:/:rpool/ROOT/nv95:1
11:boot-device:/dev/dsk/c4d0s3:2
6:s10u7-baseline:C:0
6:/:rpool/ROOT/s10u7-baseline:1
6:boot-device:/dev/dsk/c4d0s3:2
7:nv114:C:0
7:/:rpool/ROOT/nv114:1
7:boot-device:/dev/dsk/c4d0s3:2
5:test:C:0
5:/:rpool/ROOT/test:1
5:boot-device:/dev/dsk/c4d0s3:2
We can see that the boot environment named test is (still) BE #5 and has it's root file system at rpool/ROOT/test. This is the default dataset name and indicates that the boot environment has not been renamed. Consider the following example for a more complicated configuration.
# lucreate -n scooby
# lufslist scooby | grep ROOT
rpool/ROOT/scooby       zfs            241152 /                   -
rpool/ROOT              zfs       39284664832 /rpool/ROOT         -

# lurename -e scooby -n doo
# lufslist doo | grep ROOT
rpool/ROOT/scooby       zfs            241152 /                   -
rpool/ROOT              zfs       39284664832 /rpool/ROOT         -
The point is that we have to trust the contents of /etc/lutab but it does not hurt to do a bit of sanity checking before we start deleting ZFS datasets. To remove boot environment test from the view of Live Upgrade, delete the three lines in /etc/lutab starting with 5 (in this example). We should also remove it's Internal Configuration File (ICF) /etc/lu/ICF.5
# mv -f /etc/lutab /etc/lutab.old
# grep -v ^5: /etc/lutab.old > /etc/lutab
# rm -f /etc/lu/ICF.5

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
s10u5_baseline             yes      no     no        yes    -         
s10u5_lu                   yes      no     no        yes    -         
s10u6_ufs                  yes      no     no        yes    -         
s10u6_baseline             yes      no     no        yes    -         
route66                    yes      no     no        yes    -         
nv95                       yes      yes    yes       no     -         
s10u7-baseline             yes      no     no        yes    -         
nv114                      yes      no     no        yes    -         
If the boot environment being deleted is in UFS then we are done. Well, not exactly - but pretty close. We still need to propagate the updated configuration files to the remaining boot environments. This will be done during the next live upgrade operation (lucreate, lumake, ludelete, luactivate) and I would recommend that you let Live Upgrade handle this part. The exception to this will be if you boot directly into another boot environment without activating it first. This isn't a recommended practice and has been the source of some of my most frustrating mistakes.

If the exorcised boot environment is in ZFS then we still have a little bit of work to do. We need to delete the old root datasets and any snapshots that they may have been cloned from. In our example the root dataset was rpool/ROOT/test. We need to look for any children as well as the originating snapshot, if present.
# zfs list -r rpool/ROOT/test
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/test       234K  6.47G  8.79G  /.alt.test
rpool/ROOT/test/var    18K  6.47G    18K  /.alt.test/var

# zfs get -r origin rpool/ROOT/test
NAME             PROPERTY  VALUE                 SOURCE
rpool/ROOT/test  origin    rpool/ROOT/nv95@test  -
rpool/ROOT/test/var  origin    rpool/ROOT/nv95/var@test    
       
# zfs destroy rpool/ROOT/test/var
# zfs destroy rpool/ROOT/nv95/var@test
# zfs destroy rpool/ROOT/test
# zfs destroy rpool/ROOT/nv95@test
Important note:luactivate will promote the newly activated root dataset so that snapshots used to create alternate boot environments should be easy to delete. If you are switching between boot environments without activating them first (which I have already warned you about doing), you may have to manually promote a different dataset so that the snapshots can be deleted.

To BE or not to BE - how about no BE ?

You may find yourself in a situation where you have things so scrambled up that you want to start all over again. We can use what we have just learned to unwind Live Upgrade and start from a clean configuration. Specifically we want to delete /etc/lutab, the ICF and related files, all of the temporary files in /etc/lu/tmp and a few files that hold environment variables for some of the lu scripts. And if using ZFS we will also have to delete any datasets and snapshots that are no longer needed.
 
# rm -f /etc/lutab 
# rm -f /etc/lu/ICF.* /etc/lu/INODE.* /etc/lu/vtoc.*
# rm -f /etc/lu/.??*
# rm -f /etc/lu/tmp/* 

# lustatus
ERROR: No boot environments are configured on this system
ERROR: cannot determine list of all boot environment names

# lucreate -c scooby -n doo
Checking GRUB menu...
Analyzing system configuration.
No name for current boot environment.
Current boot environment is named <scooby>.
Creating initial configuration for primary boot environment <scooby>.
The device </dev/dsk/c4d0s3> is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name <scooby> PBE Boot Device </dev/dsk/c4d0s3>.
Comparing source boot environment <scooby> file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <doo>.
Source boot environment is <scooby>.
Creating boot environment <doo>.
Cloning file systems from boot environment <scooby> to create boot environment <doo>.
Creating snapshot for <rpool/ROOT/scooby> on <rpool/ROOT/scooby@doo>.
Creating clone for <rpool/ROOT/scooby@doo> on <rpool/ROOT/doo>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/doo>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <doo> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <doo> in GRUB menu
Population of boot environment <doo> successful.
Creation of boot environment <doo> successful.

# luactivate doo
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE 

File  deletion successful
File  deletion successful
File  deletion successful
Activation of boot environment  successful.

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
scooby                     yes      yes    no        no     -         
doo                        yes      no     yes       no     -        
Pretty cool, eh ?

There are still a few more interesting corner cases, but we will deal with those in the one of the next articles. In the mean time, please remember to

Technocrati Tags:

May 21 2009, 11:58:42 AM CDT Permalink Comments [7]