Wednesday Sep 09, 2009

I am delighted to announce that Casper Dik's "Turbo-Charging SVR4 Package Install" feature enhancement is now available by downloading and installing the following patches:

119254-70 (SPARC) / 119255-70 (x86): Patch utilites patch
121428-13 (SPARC) / 121429-13 (x86): Live Upgrade Zones Support Patch
121430-40 (SPARC) / 121431-41 (x86): Live Upgrade Patch
124630-28 (SPARC) / 124631-29 (x86): System Administration Applications, Network, and Core Libraries Patch

It is important to apply 119254-70 (SPARC) / 119255-70 (x86) and 121428-13 (SPARC) / 121428-13 (x86) if the system is running non-global zones.  Otherwise booting newly installed zones will fail until the pkgserv daemon exits, about 5 minutes after zoneadm install finished.  Zones which were already installed can be booted as expected.

"Turbo Packaging" is designed to dramatically improve the performance of install, upgrade, Live Upgrade, and zone creation on Solaris 10.  It delivers only a small improvement for patching performance.   (See my Zones Parallel Patching blog entry for information on dramatic patching performance improvements.)

For background reading on the "Turbo Packaging" feature, please see
http://www.opensolaris.org/jive/thread.jspa?messageID=358081 and http://arc.opensolaris.org/caselog/PSARC/2009/173/

The "Turbo-charging SVR4 Package Install" feature will also be included in the forthcoming Solaris 10 Update 8 release and will be documented in the Release Notes for that release.

Great work, Casper - well done!

Friday Aug 14, 2009

My colleague, Ed Clark, has made significant improvements to the Solaris 10 Recommended and Sun Alert patch clusters.  These improvements have just been released and are in the current clusters available to contract customers from the Patch Cluster & Patch Bundle Downloads on SunSolve.

Ed's improvements include:

  • Filtering out "false negatives" from the patch utility return codes, so that if the cluster install script returns "1", you know you've got a real problem which needs investigating.   As you may know, the Solaris patch utility, 'patchadd', can return errors for some acceptable situations - for example, if the patch is already applied to the system, or a later revision of the patch or a patch which obsoletes it is already applied to the system, or none of the packages in the patch are on the target system (e.g. because a reduced Install Metacluster was used to install it or the system has been security hardened by package removal), etc.   Such conditions are acceptable "errors" which do not usually require further investigation by the user.  By filtering these conditions out, if the 'installcluster' script returns "1", you know it isn't because of one of these acceptable "errors", and therefore you need to look at the logfiles to find out what's gone wrong.  For further information, please see the cluster README and Analyzing a patchadd or patchrm Failure in the Solaris OS.
  • The new 'installcluster' script will exit as soon as it encounters an unexpected failure - i.e. not one of the acceptable "errors" mentioned above.  This prevents potentially compounding issues by attempting to apply further patches.
  • The new 'installcluster' script includes context intelligence for patching operations.   It informs the user when zones need to be halted, and it provides phased installation to handle patches which absolutely require an immediate reboot before further patches can be applied.  Such interim reboots are only needed when patching a live boot environment on a system below Kernel patch 118833-36 (SPARC) / 118855-36 (x86) and well as the earlier interim reboot required on x86 related to 'libc.so' patches and Kernel patch 118844-14.  On systems below these patch levels, the 'installcluster' will stop at the appropriate point when patching the live boot environment, and inform the user to reboot and re-invoke the 'installcluster' script.  (In the old cluster install script, it simply tried to carry on blindly past such interim reboots, spewing out error messages, although code in the relevant patches prevented any harm from being done).  These interim reboots, when required, are dealt with relatively early in the cluster install sequence so that once completed, the Sys Admin can leave the rest of the installation to finish unattended and move onto other systems.
  • The new 'installcluster' script provides better integration with Solaris Live Upgrade as the user can now specify the Live Upgrade alternate boot environment to patch by name.
  • The new 'installcluster' script performs space checking prior to installing each patch, and will halt if it believes there is insufficient space to complete the installation successfully.  For example, this helps avoid non-global zones getting out of sync regarding patch levels with respect to the global zone.  This is an important enhancement as running out of space during patching can potentially leave the system in an inconsistent state and is to be avoided.  Even removing a patch requires space, so immediate removal of a patch which has failed to apply correctly due to space issues should be avoided until sufficient space is freed up and potential issues caused by its partial installation investigated - for example, was the undo.Z file successfully created to enable backout ? (Tip: It may be better to retry the patch installation once space has been freed up rather than patch removal in such circumstances.  Contact Sun Support for instructions if you encounter such issues.).   The space checking enhancements in the 'installcluster' script are designed to prevent such problems occurring.
  • The messages and log files produced by the 'installcluster' script are clear and well structured.  For example, a "failed" log is created if a patch fails to apply.  See the Cluster README for further information.
  • The 'patch_order' places patches in an optimal order for installation to avoid known issues - for example, the patch utilities patches are installed as early in the sequence as possible to avoid hitting patch installation bugs which are fixed in the patch utility patches, and the Kernel patch procedural script override patch, 125555 (SPARC) / 125556 (x86), is ordered prior to 137137-09 (SPARC) / 137138-09 (x86) to resolve some known issues.  When patching an alternate boot environment (which is recommended), a small sub-set of pre-requisite patches, primarily the patch utility patches, need to be applied to the live boot environment to ensure correct patching operation.  The 'installcluster' script will check for these pre-requisite patches are halt installation if they are not present, advising the user of the 'installcluster' script option to use to install these pre-requisite patches.   Further patches may need to be installed on the live boot environment to support Live Upgrade.  See the cluster README for further information.
  • The patches have been moved to a 'patches' sub-directory, to de-clutter the top level directory of the unzipped cluster.
  • Please see the cluster README file for further information.  Customers should read the cluster README file and look at the Special Install Instructions in the patches within the cluster prior to installation.

I really want to thank Ed Clark for the enormous amount of thought and effort he has put into improving the cluster installation experience.   The work he's done on the Solaris 10 Recommended and Sun Alert patch cluster is a continuation of his previous work on the Solaris Update Patch Bundles and the Solaris 10 Live Upgrade Zones Starter Patch Bundle.  Nice work, Ed!

While the 'installcluster' script is copyrighted, I am happy for customers to use it, and the 'patch_order' file, as a starting point for their own customized patch bundles, so long as it is for their own use and is not to be given to a 3rd party or used for commercial gain (e.g. by a 3rd party maintainer or 3rd party commercial automation tool).

We have also made significant improvements to the back end processes to ensure higher and more consistent cluster quality. 

Originally, the clusters were created by the Patch Operations and Distribution (POD) team after patch release.  The POD Cluster QA process left a lot to be desired, resulting in inconsistent cluster quality.   To plug this gap, my Patch System Test team have been testing the clusters for several years, but the old process only allowed us to test them in parallel with their release, which meant that we found issues at the same time that early downloaders of the cluster encountered them.  Although we ensured such issues were fixed as quickly as possible, it still obviously compromised our customers' experience.

In the new process, the clusters are routed to Patch System Test (PST) prior to release.  PST run a transformation script on them to optimize the patch installation order, etc.  The clusters will only be released once they have passed PST testing.  This should ensure higher and more consistent quality for customers.  Work is continuing to move the entire patch cluster generation process to PST, although these future backend enhancements in this regard should be invisible to customers.

Thursday Jun 25, 2009

I'd like to give you a heads-up on a couple of Kernel patch installation issues:

1. There was a bug (since fixed) in the Deferred Activation Patching functionality in a ZFS Root environment on x86 only.  See Sun Alert 263928.  An error message to the effect that a Class Action Script has failed to complete and failure to set up environment for Deferred Activation Patching may be seen.   The relevant CR is 6850329: "KU 139556-08 fails to apply on x86 systems that have ZFS root filesystems and corrupts the OS".    SPARC systems are similarly affected.  The following error message is returned:
mv: cannot rename /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to /lib/libc.so.1: Device busy
ERROR: Move of /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to dstActual failed
usage: puttext [-r rmarg] [-l lmarg] string
pkgadd: ERROR: class action script did not complete successfully

Installation of <SUNWcslr> failed.

This issue is fixed in patch in the Patch Utilities patch 119255-70 or later revision.

BTW: The principal reason ZFS Root support was implemented in Live Upgrade is so that patch application like this to the live boot environment would not be necessary.   With ZFS Root, creating a clone Boot Environment is so easy that there's no good reason not to.   This avoids the need to use technologies such as Deferred Activation Patching which attempt to make it safer to apply arbitrary change to a live boot environment, which is an inherently risky process.

2. There are reproducible issues using jumpstart finish scripts and other scenarios to install Kernel patch 137137-09 followed by Kernel patch 139555-08.   Here's the gist of the issue which I've pulled from an engineering email thread on the subject:

Issue 1: I have a customer whose system is not booting after applying the patch cluster with Live Upgrade (LU).

Solution 1: If using 'luupgrade -t', then you must ensure that latest version of LU patch is installed first, currently 121430-36 is currently the latest revision on SPARC, 121431-37 on x86. Once these patches are installed, LU will automatically handle the build of the boot archive when 'luactivate' is called, thus avoiding the problem.

Issue 2: There are other ways to get oneself into situations where a boot archive is out of sync: e.g. If using jumpstart finish scripts to apply patches that include 137137-09.  Basically any operation that involves patching to an ABE outside of 'luupgrade' will involve a manual build of boot-archive.

Solution 2: One must manually rebuild the boot-archive on the /a partition after applying the patches.  Otherwise once the system boots, the boot-archive will be out of sync.

Here's some more detail on the jumpstart finish script version of this: 

We've seen the same panic a few times when the latest patch cluster is applied via a finish script to a boot environment prior to  s10u6 via a jumpstart installation. It appears that the boot archive is out of sync with the kernel on the system. The boot archive was created from the 137137-09 patch and not updated after the 139555-08 kernel was applied, therefore the mismatch between the kernel and the boot archive.

In these instances updating the boot archive allows the system to boot successfully. Boot failsafe (ok boot -F failsafe) will detect an out of sync boot archive.  Execute the automated update then reboot.  This will now boot from the later kernel (139555-08) which successfully installed from the finish script.

I reproduced the problem in a jumpstart installation environment applying the latest 10_Recommended patch cluster from a finish script. The initial installation was S10U5 which is deployed from a miniroot that has no knowledge of a boot archive (my theory anyway).  This is similar to a live upgrade environment if the boot environment doing the patching is also boot archive unaware (meaning the kernel is pre 137137-09).

In the jumpstart scenario the immediate problem was solved by updating the boot archive by booting failsafe as previously described.  The Solution was to update the boot archive from the finish script after the patch cluster installation completed.  BTW, all patches in the patch cluster installed successfully per the /var/sadm/system/logs.finish.log.

In a standard jumpstart the boot device (install target) is mounted to /a, therefore adding the following entry to the finish script solved the problem:

/a/boot/solaris/bin/create_ramdisk -R /a

Depending on the finish script configuration, and variables the following would also work:

$ROOTDIR/boot/solaris/bin/create_ramdisk -R $ROOTDIR
Issue 3: This above issues are sometimes mis-diagnosed as CR 6850202: "bootadm fails to build bootarchive in certain configurations leading to unbootable system".

But CR 6850202 will only be encountered in very specific circumstances, all of which must occur in order to hit this specific bug, namely:

1. Install u6 SUNWCreq - there's  no mkisofs so we build ufs boot archive

2. Limit /tmp to 512M - thus forcing the ufs build to happen in /var/run

3. Have a separate /var - bootadm.c only lofs nosub mounts "/" when creating the alt root for DAP patching build of boot archive

4. Install 139555-08

You must have all 4 of above in order to hit this, i.e. step 4 must be installing a DAP patch such as a Kernel patch associated with a Solaris 10 Update such as 139555-08. 

Solution 3: Removing the 512MB limit (or whatever limit has been imposed) to /tmp in /etc/vfstab and/or adding SUNWmkcd (and probably SUNWmkcdS) so that mkisofs is available on the system is sufficient to avoid the code path that fails this way.

Booting failsafe and recreating the boot archive will successfully recreate the boot archive.

Here's further input from one of my senior engineers, Enda O'Connor:

If using Live Upgrade (LU), and LU on the live partition is up to date in terms of latest revision of the LU patch, 121430 (SPARC) and 121431 (x86), the boot-archive will be built automatically once users runs shutdown ( after luactivate to activate the new BE ).  This is done from a kill script in rcd.0.

If using a jumpstart finish script, or jumpstart profile to patch a pre-U6 image with latest kernel patches, then you need to run create_ramdisk from the finish script after all patching/packaging operations have been finished.  Alternatively, you can patch your pre-U6 miniroot to the U6 SPARC NewBoot level (137137-09), at which point the modified miniroot will handle the build of the boot_archive after the finish script has run.

If patching U6 and upwards from jumpstart, the boot_archive will get built automatically after finish script has run, so there's no issue in this scenario.

If using any home grown technology to patch or install/modify software on an Alternate Boot Environment ( ABE ), such as ufsrestore/cpio/tar for example, you must always run create_ramdisk manually before booting to said ABE.

Best Wishes,

Gerry.

Wednesday Jan 09, 2008

As mentioned previously, Solaris Live Upgrade can help minimize the downtime associated with patching, by enabling users to patch an inactive boot environment.  When all modifications have been made to the inactive boot environment, a single reboot is required to activate it.  Also, most Special Install Instructions specified in patch READMEs can be ignored when patching an inactive boot environment.

When patching a live boot environment, certain patches require system downtime in order to complete their installation.

Such requirements will be specified in the patch README file and is also specified by the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s).

As mentioned previously, the problem with patching a live boot environment (without Deferred Activation Patching) is that some objects delivered in a patch, such as shared objects, may be invoked immediately while other objects, such as genunix, will only be activated when the system is rebooted.  Also, processes already running in memory may be using the old versions of objects while processes started after the patch(es) were applied will be using the new versions of the same objects (when Deferred Activation Patching isn't specified).  In some cases this is harmless.  In other cases, the system may be in a potentially inconsistent state until it is rebooted.

Some patches require that they be installed in Single User Mode (run level S) when applied to a live boot environment.  This is to ensure that the system is in a quiesced state to avoid the above potential problems.  This will be specified in the patch README file.  Also, the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s) will contain the entry singleuser_required.

Some patches require that the system be rebooted at some convenient point after the patch is applied in order to activate its contents. Either a normal reboot or a reconfigure reboot may be required. This will be specified in the patch README file.  Also, the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s) will contain the entry rebootafter or reconfigafter.  For such patches, the system remains in a consistent state until the reboot takes place.  The reboot is simply to allow the changes supplied by the patch to be activated. If a reconfigure reboot is required, application of the patch will cause the creation of a /.reconfigure file which will result in a reconfigure reboot when the system is rebooted.

Some patches require that the system be rebooted immediately after the patch is applied to a live boot environment.  For such patches, the system is in a potentially inconsistent state until the reboot takes place.  Either a normal reboot or a reconfigure reboot may be required. This will be specified in the patch README file.  Also, the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s) will contain the entry rebootimmediate or reconfigimmediate.  If a reconfigure reboot is required, application of the patch will cause the creation of a /.reconfigure file which will result in a reconfigure reboot when the system is rebooted.  Normally, it is OK to apply further patches to the live boot environment before initiating the reboot.  However, normal operations must not be resumed until after the reboot is performed.  On the rare occasion where the reboot must be instigated before any further patches are applied, such as is the case with Solaris 10 Kernel patch 118833-36 (SPARC) / 118855-36 (x86), such patches will typically contain code to prevent further patches from being applied as an added safety mechanism.

This blog copyright 2009 by Gerry Haskins