Who loves to patch or upgrade a system ?
That's right, nobody. Or if
you do perhaps we should start a local support group to help you come
to terms with this unusual fascination. Patching, and to a lesser extent
upgrades (which can be thought of as patches delivered more efficiently
through package replacement), is the the most common complaint that I
hear when meeting with system administrators and their management.
Most of the difficulties seem to fit into one of the following
categories.
- Analysis: What patches need to be applied to my system ?
- Effort: What do I have to do to perform the required maintenance ?
- Outage: How long will the system be down to perform the maintenance ?
- Recovery: What happens when something goes wrong ?
And if a single system gives you a headache, adding a few containers into
the mix will bring on a full migraine. And without some relief you may be left
with the impression that containers aren't worth the effort. That's
unfortunate because containers don't have to be troublesome and patching doesn't
have to be hard. But it does take getting to know one of the most important
and sadly least used features in Solaris:
Live Upgrade
Before we looking at Live Upgrade, let's start with a definition. A
boot environment is
the set of all file systems and devices that are unique to an instance of Solaris on a system.
If you have several boot environments then some data will be shared (non svr4 package installed
applications, data, local home directories) and some will be
exclusive to one boot environment. Not making this more complicated than it needs to be,
a boot environment is generally your root (including /usr and /etc), /var (frequently split out on
a separate file system), and /opt. Swap may or may not be
a part of a boot environment - it is your choice. I prefer to share swap, but there are some
operational situations where this may not be feasible. There may be additional items, but
generally everything else is shared. Network mounted file systems and removable media are assumed
to be shared.
With this definition behind us, let's proceed.
Analysis: What patches need to be applied to my system ?
For all of the assistance that Live Upgrade offers,
it doesn't do anything to help with the analysis phase. Fortunately there
are plenty of tools that can help with this phase. Some of them work nicely
with Live Upgrade, others take a bit more effort.
smpatch(1M) has an analyze capability that can determine which patches need
to be applied to your system. It will get a list of patches from an update
server, most likely one at Sun, and match up the dependencies and requirements
with your system. smpatch can be used to download these patches for future
application or it can apply them for you. smpatch works nicely with Live Upgrade,
so from a single command you can upgrade an alternate boot environment. With containers!
The
Sun Update Manager is a simple to use graphical front end for smpatch. It
gives you a little more flexibility during the inspection phase by allowing
you to look at individual patch README files. It is also much easier to see
what collection a patch belongs to (recommended, security, none) and if the
application of that patch will require a reboot. For all of that additional
flexibility you lose the integration with Live Upgrade. Not for lack of
trying, but I have not found a good way to make Update Manager and Live
Upgrade play together.
Sun xVM Ops Center has a much more sophisticated patch analysis system that
uses additional knowledge engines beyond those used by smpatch and Update
Manager. The result is a higher quality patch bundle tailored for each
individual system, automated deployment of the patch bundle, detailed auditing
of what was done and simple backout should problems occur. And it basically
does the same for Windows and Linux. It is this last feature that makes things
interesting. Neither Windows nor Linux have anything like Live Upgrade and
the least common denominator approach of Ops Center in its current state means
that it doesn't work with Live Upgrade. Fortunately this will change in the
not too distant future, and when it does I will be shouting about this feature
from rooftops (OK, what I really mean is I'll post a blog and a tweet about
it). If I can coax Ops Center into doing the analysis and download pieces then
I can manually bolt it onto Live Upgrade for a best of both worlds solution.
These are our offerings and there are others. Some of them are quite good and in use
in many places.
Patch Check Advanced (PCA) is one of the more common tools in
use. It operates on a patch dependency cross reference file and does a good job with the
dependency analysis (this is obsoleted by that, etc). It can be used to maintain an
alternate boot environment and in simple cases that would be fine. If the alternate
boot environment contains any containers then I would use Live Upgrade's luupgrade
instead of PCA's patchadd -R approach. If I was familiar with PCA then I would still
use it for the analysis and download feature. Just let luupgrade apply the patches.
You might have to uncompress the patches downloaded by PCA before handing them over
to luupgrade, but that is a minor implementation detail.
In summary, use an analysis tool appropriate to the task (based on familiarity, budget
and complexity) to figure out what patches are needed. Then use Live Upgrade (luupgrade)
to deploy the desired patches.
Effort: What does it take to perform the required maintenance ?
This is a big topic and I could write pages on the subject. Even if I use an analysis
tool like smpatch or pca to save me hours of trolling through READMEs drawing dependency
graphs, there is still a lot of work to do in order to survive the ordeal of applying
patches. Some of the more common techniques include ....
Backing up your boot environment.
I should not have to mention this, but there are some
operational considerations unique to system maintenance. Even though tiny, there is
a greater chance that you will render your system non-bootable during system maintenance
than any other operational task. Even with mature processes, human factors can come into
play and bad things can happen (oops - that was my fallback boot environment that I just
ran newfs(1M) on).
This is why automation and time tested scripting becomes so important.
Should you do the unthinkable and render a system nonfunctional, rapid restoration of the
boot environment is important. And getting it back to the last known good state is just
as important. A fresh backup that can be restored by utilities from install media or
jumpstart miniroot is a very good idea. Flash archives (see
flarcreate(1M)) is even better,
although
complications with containers make this less interesting now than in previous
releases of Solaris. How many of you take a backup before applying patches ? Probably
about the same number as replace batteries in your RAID controllers or change out
your UPS systems after their expiration date.
Split Mirrors
One interesting technique is to split mirrors instead of backups. Of course this only works
if you mirror your boot environment (a recommended practice for those systems with adequate
disk space). Break your mirror, apply patches to the non-running half, cut over the
updated boot environment during the next maintenance window and see how this goes. At
first glance this seems like a good idea, but there are two catches.
- Do you synchronize dynamic boot environment elements ? Things like /etc/passwd,
/etc/shadow, /var/adm/messages, print and mail queues are constantly changing. It is
possible that these have changed between the mirror split and subsequent activation.
- How long are you willing to run without your boot environment being mirrored ?
This may cause to you certify the new boot environment too quickly. You want to
reestablish your mirror, but if that is your fallback in case of trouble you have
a conundrum. And if you are the sort that seems to have a black cloud following you
through life, you will discover a problem shortly after you started the mirror resync.
Pez disks ?
OK, the mirror split thing can be solved by swinging in another disk. Operationally a
bit more complex and you have at least one disk that you can't use for other purposes
(like hosting a few containers), but it can be done. I wouldn't do it (mainly because I
know where this story is heading) but many of you do.
Better living through Live Upgrade
Everything we do to try to make it better adds complexity, or another hundred lines of
scripting. It doesn't need to be this way, and if you become one with the LU commands
it won't for you either. Live Upgrade will take care building and updating multiple
boot environments. It will check to make sure the disks being used are bootable
and not part of another boot environment. It works with the Solaris Volume
Manager, Veritas encapulated root devices, and starting with Solaris 10 10/08
(update 6) ZFS. It also takes care of the synchronization
problem. Starting with Solaris 10 8/07 (update 4), Live Upgrade also works
with containers, both native and branded (and with Solaris 10 10/08 your zoneroots
can be in a ZFS pool).
Outage: How long will my system be down for the maintenance?
Or perhaps more to the point, how long will my applications be unavailable ? The
proper reply is it depends on how big the patch bundle is and how many containers
you have. And if a kernel patch is involved, double or triple your estimate.
This can be a big problem and cause you to take short cuts like only install some
patches now and others later when it is more convenient. Our good friend
Bart
Smaalders has a nice discussion on the
implications of this approach and
what we
are doing in OpenSolaris to solve this. That solution will eventually work its
way into the Next Solaris, but in the mean time we have a problem to solve.
There is a large set (not really large, but more than one) of patches that require
a quiescent system to be properly applied. An example would be a kernel patch
that causes a change to libc. It is sort of hard to rip out libc on a running
system (new processes get the new libc my may have issues with the running
kernel, old processes get the old libc and tend to be fine, until they do a
fork(2) and exec(2)). So we developed a brilliant solution to this problem -
deferred activation patching. If you apply one of these troublesome patches
then we will throw it in a queue to be applied the next time the system is
quiesced (a fancy term for the next time we're in single user mode). This solves the
current system stability concerns but may make the next reboot take a bit longer.
And if you forgot you have deferred patches in your queue, don't get anxious and
interrupt the shutdown or next boot. Grab a noncaffeinated beverage and
put some Bobby McFerrin on your iPod. Don't Worry, Be Happy.
So deferred activation patching seems like a good way to deal with situation
where everything goes well. And some brilliant engineers are working on applying
patches in parallel (where applicable) which will make this even better. But what happens when things go wrong ? This is
when you realize that
patchrm(1M) is not your friend. It has never been your
friend, nor will it ever be. I have an almost paralyzing fear of dentists, but would rather visit one then start
down a path where patchrm is involved. Well tested tools and some automation can reduce this to simple
anxiety, but if I could eliminate patchrm altogether I would be much happier.
For all that Live Upgrade can do to ease system maintenance, it is in the area of outage and
recovery that make it special. And when speaking about Solaris, either in
training or evangelism events, this is why I urge attendees to drop whatever they are doing
and adopt Live Upgrade immediately.
Since Live Upgrade (lucreate, lumake, luupgrade) operates on an alternate boot environment,
the currently running set of applications are not affected. The system stays up, applications
stay running and nothing is changing underneath them so there is no cause for concern. The only
impact is some additional load by the live upgrade operations. If that is a concern then run
live upgrade in a project and cap resource consumption to that project.
An interesting implication of Live Upgrade is that the operational sanity of each step is no
longer required. All that matters is the end state. This gives us more freedom to apply patches
in a more efficient fashion than would be possible on a running boot environment. This is
especially noticeable on a system with containers. The time that the upgrade runs is significantly
reduced, and all the while applications are running. No more deferred activation patches, no more
single user mode patching. And if all goes poorly after activating the new boot environment you
still have your old one to fall back on. Queue Bobby McFerrin for another round of "Don't Worry, Be Happy".
This brings up another feature of Live Upgrade - the synchronization of system files in flight
between boot environments. After a boot environment is activated, a synchronization process
is queued as a K0 script to be run during shutdown. Live Upgrade will catch a lot of private
files that we know about and the obvious public ones (/etc/passwd, /etc/shadow, /var/adm/messages,
mail queues). It also provides a place (/etc/lu/synclist) for you to include things we might not
have thought about or are unique to your applications.
When using Live Upgrade applications are only unavailable for the amount of time it takes
to shut down the system (the synchronization process) and boot the new boot environment.
This may include some minor SMF manifest importing but that should not add much
to the new boot time. You only have to complete the restart during a maintenance window,
not the entire upgrade. While vampires are all the rage for teenagers these days,
system administrators can now come out into the light and work regular hours.
Recovery: What happens when something goes wrong?
This is when you will fully appreciate Live Upgrade. After activation of a new
boot environment, now called the Primary Boot Environment (PBE), your old boot
environment, now called an Alternate Boot Environment (ABE) can still be called upon in case of trouble.
Just activate it and shut down the system. Applications will be down for a short period
(the K0 sync and subsequence start up), but there will be no more wringing of the hands,
reaching for beverages with too much caffeine and vitamin B12, trying to remember where you kept your bottle
of Tums. Queue Bobby McFerrin one more timne and "Don't Worry, Be Happy". You will be back to your previous operational
state in a matter of a few minutes (longer if you have a large server with many disks).
Then you can mount up your ABE and troll through the logs trying to determine what went wrong.
If you have a service contract then we will troll through the logs with you.
I neglected to mention earlier, disks that comprise boot environments can be mirrored,
so there is no rush to certification. Everything can be mirrored, at all times. Which is a
very good thing. You still need to back up your boot environments, but you will find yourself
reaching for the backup media much less often when using Live Upgrade.
All that is left are a few simple examples of how to use Live Upgrade. I'll save that for next
time.
Technocrati Tags:
Sun
Solaris
patching
liveupgrade