ZFS Automatic Snapshots 0.11 Early Access
I'm just about done with this release of the ZFS Automatic Snapshot SMF service and have just pushed some changes to the mercurial repository on opensolaris.org.
This is a pretty important release, in terms of fixing stuff that's been bugging me about the service since I initially released it. But inevitably, with lots of change comes the possibility of lots of bugs - so, I was hoping to get some feedback on how it's looking before it gets officially released.
So, if you're feeling brave (that means don't use this in production yet!) fire up your favourite source code management system (which is hg, right?) and access the mostly-untested ZFS Automatic Snapshot 0.11 Early Access release via:
hg clone ssh://anon@hg.opensolaris.org/hg/jds/zfs-snapshot
I'm working with Niall & Erwann in the Desktop group here, who have been tasked with DSK-5, to get ZFS Automatic Snapshots on the desktop, and so far, it looks like my code will be providing the back-end service (obviously as well as ZFS :-) so some of the changes are things that make most sense when running this on a desktop or laptop machine.
With that in mind, I've not made any changes to my bundled GUI, since it'll be going away real soon now. However, I've done my best to ensure that there's always ways of turning off the small-system-focused bits, and the service remains backwards-compatible with earlier manifests.
So what's going to be new in 0.11 ? Well having seen Nils write up his changes in that form (more on that later), I thought I'd have a go at writing a Changelog too - so here's the annotated Changelog entry so far for 0.11:
0.11
Add RBAC support
- the service now runs under a zfssnap role
- service start/stop logs stay under /var/svc/log
- other logs saved to /export/home/zfssnap (and syslog) [ yes, this sucks a bit - better solutions welcome? ]
Add a 'zfs/interval' property value 'none' which doesn't use cron Add a cache of svcprops to the method script (good idea Nils!) Add a com.sun:auto-snapshot user property used by all instances, com.sun:auto-snapshot:$LABEL takes precedence Remove the seconds field of the snapshot name - it's not needed (good idea Håkan!) Changed the way // works with recursive snapshots - ignore snapshot-children, and instead automatically determine when we can take recursive snapshots based on which datasets have the zfs user properties Set avoidscrub to false by default (6343667 was fixed in in nv_94) Bugfix from Dan (thanks!)- Volumes are datasets too Automatically snapshot everything by setting com.sun:auto-snapshot=true on startup. (this gets done on all top level datasets - an existing property set to false on the top level dataset overrides this) Check for missed snapshots on startup Clean up shell style a bit Clean up preremove script (I need to make these scripts redundant before we move to IPS, I know) Write this Changelog
In terms of user-visibility, the most obvious changes are running under RBAC, and taking snapshots of all filesystems by default - I realise the latter could be controversial, but you can turn it off if you don't like it. I'm also pretty happy with the changes to the "//" schedule - we now ignore "zfs/snapshot-children" for this particular case, and instead use the list of filesystems marked as "com.sun:auto-snapshot=true" to work out which filesystems we can take recursive snapshots of, and which we have to take indivdual snapshots of. This makes a big difference on large systems.
One thing that's missing from this release, is Nils Goroll's suggested changes about improving the way the system performs scheduling - more details here. I feel that moving away from cron would result in less familiarity in what the service does: if cron is the problem, certainly one solution is running away from it, but wouldn't it be cool to get cron's shortcomings fixed instead? Yeah, one of those "ample-free time" problems.
So, without further ado, there's full documentation in the README - enjoy, and please let me know if you see anything weird - there's still time to fix it before 2008.11 (and yes, all this despite my day-job being super hectic right now! xVM Server is getting 99% of my time at the moment, so I definitely expect bugs this early access release of zfs-auto-snapshots!)
I try to access source code, but I receive this error message (sorry but I'm new to hg):
$ hg -v clone ssh://anon@hg.opensolaris.org/hg/jds/zfs-snapshot
running ssh anon@hg.opensolaris.org "hg -R hg/jds/zfs-snapshot serve --stdio"
remote: Permission denied (publickey,keyboard-interactive).
abort: no suitable response from remote hg!
Posted by Luca Morettoni on August 26, 2008 at 02:09 PM IST #
That's weird Luca, not sure what's going on there. Works for me (as anon, and from a non-Sun IP)
In the meantime, there's a binary drop of the SVR4 package at:
http://blogs.sun.com/timf/resource/zfs-auto-snapshot-11-EA.tar.gz
Posted by Tim Foster on August 26, 2008 at 02:26 PM IST #
Great news Tim, we're just rolling out our first ZFS servers for testing so we'll be making use of this shortly.
Do you know if there are any plans to include a single file restore GUI in Solaris at any point? Something like Apple's Time Machine, or Microsoft's Shadow Copy Client?
Posted by Ross Smith on August 26, 2008 at 07:15 PM IST #
Yes - there's an accompanying project to this that allows you to browse snapshot history in Nautilus and drag/drop any files that you want to restore, and is part of the push towards the DSK-5 2008.11 requirement. A bit like:
http://blogs.sun.com/timf/entry/zfs_on_your_desktop
which pre-dates TimeMachine :-) Hopefully, it'll be a lot more polished than my hacks at nautilus integration.
Posted by Tim Foster on August 27, 2008 at 11:33 AM IST #
*does the happy dance*
I really can't wait for 2008.11, I had a quick play with it earlier and if it's going to have snapshot integration like that it really could be the OS I've been waiting nearly a year for.
PS. Your blogs cause all kinds of problems in IE, it seems there's a script of some kind that redirects me to an error page once it all finishes loading. I generally have to use firefox to view them.
Posted by Ross on August 27, 2008 at 11:41 AM IST #
:-) Will try to fix the blog rendering at some stage, sorry about that. Glad to hear you're looking forward to 2008.11!
Posted by Tim Foster on August 27, 2008 at 11:51 AM IST #
Brett mailed to say he's seeing postinstall failures when pkgadding this to a SXCE nv_83 box. I'll dig into this and push a changeset as soon as I can.
Btw. if anyone wants to watch for commit notifications, feel free to subscribe to zfs-auto-snapshot at opensolaris dot org.
Posted by Tim Foster on August 27, 2008 at 11:21 PM IST #
Hi Tim-
Excited to start using your app :)
I'm running into the following error in opensolaris snv_91 (indiana)
-bash-3.2$ pfexec zfs-auto-snapshot-admin.sh simple
/usr/bin/zfs-auto-snapshot-admin.sh[359]: run_gui: line 292: zfs: not found
Sometimes in your script you say /usr/sbin/zfs and sometimes you just say zfs. As i'm sure you know, scripts should never assume how a user's path statement is configured.
After I fixed this up, i was able to get everything working correctly
Great work!
Posted by rob t on August 29, 2008 at 05:17 PM IST #
Good point rob t - as I say though, I'm not too concerned with the breakage of the bundled GUI, as it'll be going away real soon now. (how can you live without usr/sbin in your path ? :-)
Posted by Tim Foster on September 01, 2008 at 03:54 PM IST #
I am new to this tool--though it looks very promising, and after downloading
http://blogs.sun.com/timf/resource/zfs-auto-snapshot-11-EA.tar.gz
to a V880 test machine running S10U4, I got the following during installation:
pkgadd -d zfs-auto-snapshot-11-EA
...
Installing ZFS Automatic Snapshot Service as <TIMFauto-snapshot>
## Installing part 1 of 1.
/lib/svc/method/zfs-auto-snapshot
/usr/bin/zfs-auto-snapshot-admin.sh
/usr/share/applications/automatic-snapshot.desktop
[ verifying class <none> ]
couldn't set locale correctly
couldn't set locale correctly
[ verifying class <manifest> ]
## Executing postinstall script.
couldn't set locale correctly
couldn't set locale correctly
couldn't set locale correctly
couldn't set locale correctly
64 blocks
couldn't set locale correctly
couldn't set locale correctly
couldn't set locale correctly
couldn't set locale correctly
passwd: password information changed for zfssnap
Installation of <TIMFauto-snapshot> was successful.
The locale on this machine is set to,
# locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
Is this anything to worry about?
Thanks
Posted by Stuart Anderson on September 02, 2008 at 05:18 AM IST #
Nope that's fine Stuart - I see those messages too (and never quite worked out what the problem is, C is a perfectly valid locale) I'll dig into it, and try to work out what's going on.
Btw. there's a bugfix in the works for the avoidsnap functionality (turned off by default now) in my code - which might be important to you if you're taking snapshots and scrubbing the pool at the same time (esp. on S10U4). Will update the .tar.gz once I get the fix pushed to the mercurial workspace. Subscribe to the alias to keep up to date. (it's low traffic, don't worry)
Posted by Tim Foster on September 02, 2008 at 07:41 AM IST #
Just posted new tarball with the latest pushes -
http://blogs.sun.com/timf/resource/zfs-auto-snapshot-11ea.tar.gz
Posted by Tim Foster on September 17, 2008 at 03:15 PM IST #
I get a 404 when I try to download the latest tarball.
Posted by TG on September 18, 2008 at 03:33 PM IST #
Sorry TG - that should be:
http://blogs.sun.com/timf/resource/zfs-auto-snapshot-0.11ea.tar.gz
Posted by Tim Foster on September 18, 2008 at 03:41 PM IST #
Thanks for the link, got the software installed and working. Is there a way to configure it so that the service won't go into maintenance mode if a lock is detected because a previous run hasn't completed? Once that failure occurs, I have to clear the maint state and remove the lock to get it working again.
Posted by tg on September 22, 2008 at 03:53 PM IST #
Nope - if you're using the backup command, it's important to ensure the
intervals of your backups are sufficiently far apart that one backup completes
before the next one's started. That's there for safety really, as I don't control the command users run on the far end to retrieve the backup stream, I'd rather have it this way, than possibly have hundreds of backup jobs pile up on top of each other in the case of errors, or end up corrupting backups on the far end.
Posted by 192.18.1.36 on September 22, 2008 at 03:57 PM IST #
OK, I understand your reasoning for that.
Shortly after I posted that comment, I found my system in a state where clearing the maintenance state on the server did not get the service back online. I see this in /var/adm/messages:
Sep 22 11:16:21 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Last snapshot for svc:/system/filesystem/zfs/auto-snapshot:archive-replicate taken on Mon Sep 22 10:59 2008
Sep 22 11:16:21 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] which was greater than the 5 minutes schedule. Taking snapshot now.
Sep 22 11:16:21 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] tank/archive@zfs-auto-snap:archive-replicate-2008-09-22-10:30 being destroyed -r as per retention policy.
Sep 22 11:16:22 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Taking recursive snapshot tank/archive@zfs-auto-snap:archive-replicate-2008-09-22-11:16
Sep 22 11:16:22 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Starting incr. ZFS send of differences between tank/archive@zfs-auto-snap:archive-replicate-2008-09-22-10:59 and tank/archive@zfs-auto-snap:archive-replicate-2008-09-22-11:16.
Sep 22 11:16:23 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Error: Error performing incremental backup of tank/archive.
Sep 22 11:16:23 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Moving service svc:/system/filesystem/zfs/auto-snapshot:archive-replicate to maintenance mode.
Sep 22 11:16:23 sutl000d last message repeated 1 time
Sep 22 11:16:23 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Backups completed for tank/archive.
Sep 22 11:16:23 sutl000d svc.startd[7]: [ID 748625 daemon.error] system/filesystem/zfs/auto-snapshot:archive-replicate transitioned to maintenance by request (see 'svcs -xv' for details)
The receiving side has snapshots that the sending side does not. Is this a bug or a misconfig on my end?
Posted by TG on September 22, 2008 at 04:23 PM IST #
Hard to say TG: make the backup command as simple as it could possibly be, something like "cat > /dev/null" then see whether it succeeds or not.
Posted by 192.18.1.36 on September 22, 2008 at 04:41 PM IST #
I think I found the problem--the lock isn't being removed.
Sep 23 09:55:55 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Last snapshot for svc:/system/filesystem/zfs/auto-snapshot:archive-replicate taken on Tue Sep 23 9:30 2008
Sep 23 09:55:55 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] which was greater than the 5 minutes schedule. Taking snapshot now.
Sep 23 09:55:55 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Taking recursive snapshot tank/archive@zfs-auto-snap:archive-replicate-2008-09-23-09:55
Sep 23 09:55:56 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Unable to backup tank/archive: incremental\ backup\ in\ progress\ by\ PID\ 5127.
Sep 23 09:55:56 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] A lock prevented us from performing an incremental backup.
Sep 23 09:55:56 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Error: Unable to backup filesystem tank/archive using incremental backup strategy.
Sep 23 09:55:56 sutl000d zfs-auto-snap: [ID 702911 daemon.notice] Moving service svc:/system/filesystem/zfs/auto-snapshot:archive-replicate to maintenance mode.
Here's what happens when I try to remove the lock manually (grabbed the commands from /lib/svc/method/zfs-auto-snapshot, so let me know if I'm doing something wrong.)
-bash-3.2# svcprop -p zfs/backup-lock svc:/system/filesystem/zfs/auto-snapshot:archive-replicate
incremental\ backup\ in\ progress\ by\ PID\ 5127
-bash-3.2# svccfg -s svc:/system/filesystem/zfs/auto-snapshot:archive-replicate setprop zfs/backup-lock = astring: "unlocked"
-bash-3.2# svcprop -p zfs/backup-lock svc:/system/filesystem/zfs/auto-snapshot:archive-replicate
incremental\ backup\ in\ progress\ by\ PID\ 5127
-bash-3.2#
I'm using OpenSolaris 2008.11 snv_97 X86.
Posted by TG on September 23, 2008 at 03:41 PM IST #
Nope, you need to do a svcadm refresh anytime you change a property (or a group of them) Setting it to unlocked doesn't impact the live state of the service until you refresh it. In the code we do:
741 svccfg -s $FMRI setprop zfs/backup-lock = astring: "unlocked"
742 svcadm refresh $FMRI
For example,
$ svccfg -s svc:/system/filesystem/zfs/auto-snapshot:daily setprop zfs/backup-lock = astring: "BAR"
$ svcprop -p zfs/backup-lock svc:/system/filesystem/zfs/auto-snapshot:daily
unlocked
$ svcadm refresh svc:/system/filesystem/zfs/auto-snapshot:daily
$ svcprop -p zfs/backup-lock svc:/system/filesystem/zfs/auto-snapshot:daily
BAR
Posted by Tim Foster on September 23, 2008 at 03:50 PM IST #
FYI
The install fails unless /export/home exists.
## Installing part 1 of 1.
/lib/svc/method/zfs-auto-snapshot
/usr/bin/zfs-auto-snapshot-admin.sh
/usr/share/applications/automatic-snapshot.desktop
[ verifying class <none> ]
couldn't set locale correctly
couldn't set locale correctly
[ verifying class <manifest> ]
## Executing postinstall script.
couldn't set locale correctly
couldn't set locale correctly
UX: /usr/sbin/roleadd: ERROR: Unable to create the home directory: No such file or directory.
ERROR: Unable to create zfssnap role!
passwd: User unknown: zfssnap
Permission denied
ERROR: Unable to make zfssnap a no-password account
/var/sadm/pkg/SUNWzfs-auto-snapshot/install/postinstall: /export/home/zfssnap/.profile: cannot create
pkgadd: ERROR: postinstall script did not complete successfully
Installation of <SUNWzfs-auto-snapshot> failed.
Posted by DP on September 23, 2008 at 04:00 PM IST #
Thanks DP - yep, that was one of the things we needed to change for the upcoming push to nv_100. See
timf@haiiro[922] hg log -r 24 --style /home/timf/.hgstyle
changeset: 24:b87c7b4c0350
user: Tim Foster <tim.foster@sun.com>
date: Mon Sep 22 13:37:37 2008 +0100
description:
Remove bundled GUI and fix up pkginfo files
Remove zfssnap user home directory, and fix logging accordingly
I really should update the tarball in the link above.
Posted by Tim Foster on September 23, 2008 at 04:06 PM IST #
Is it possible to configure the maximum amount of snapshots? I had a thumper with about sixthousand snapshots and booting was no fun...
Posted by Lars Timmann on November 04, 2008 at 11:20 AM GMT #
A maximum number of snapshots taken by this service, or in general? For this service, you can set the "zfs/keep" property on each instance to specify how many snapshots you want retained.
In general though, no, zfs doesn't limit the number of snapshots you have.
Posted by Tim Foster on November 04, 2008 at 11:32 AM GMT #
The problem I had was that I wrote a snapshot script myself that looked just for filesystems it can snapshot and just did it. Then we added many filesystems and didn't calculate how many snapshots we will get. So we had that many that booting took us over an hour. I know that I have no limits from zfs but it would be nice to have some limits from your tool.
Posted by Lars Timmann on November 04, 2008 at 11:55 AM GMT #
Oh... and yes i saw the zfs/keep limit property :-).
It is great but a general maximum would be nice, too.
Posted by Lars Timmann on November 04, 2008 at 11:57 AM GMT #
No worries Lars - it'd be interesting to file a bug on the behaviour you're seeing: afaik, snapshots shouldn't affect boot time.
As I say, the Automatic Snapshot service already limits the number of snapshots it keeps per filesystem (but not sums of all snapshots of all filesystems)
Posted by Tim Foster on November 04, 2008 at 11:59 AM GMT #
Tim, maybe I'm missing something really simple here, but when I try to 'pkgadd -d zfs-auto-snapshot-0.11ea', I get this output:
'pkgadd: ERROR: unable to open package <src> pkgmap file </root/packages/zfs-auto-snapshot-0.11ea/src/pkgmap>: No such file or directory'
I've already done a 'make' in the source directory - what else am I missing?
thanks,
Blake
Posted by Blake Irvin on November 10, 2008 at 07:18 PM GMT #
The EA package is a bit old at this stage, Blake - you're probably better off cloning from the hg repository if you can. But, the steps to build and install the package are the same:
$ make ; su root -c pkgadd -d proto SUNWzfs-auto-snapshot
Make builds a package inside the ./proto directory of the distribution area, so the -d option to pkgadd tells it to look there for packages to install.
Posted by Tim Foster on November 10, 2008 at 08:05 PM GMT #
Thanks for all your work on zfs-auto-snapshot. It's coming in real handy for me now.
I'd like to add a comment to 6766696, but even after logging-in, can't find an "add comment" button. Is this even possible?
Anyways...
For those of us who'd like to use this on Solaris where ksh is ksh88, the substitution (line 791 of zfs-auto-snapshot) is:
echo ${LIST%%//}
Posted by Dimitri on November 13, 2008 at 03:27 PM GMT #
Thanks Dimitri - alas, ksh88 doesn't have a proper replacement method from what I can see. For example:
timf@xenbld[2946] export LIST="a// b// c// d"
timf@xenbld[2947] echo ${LIST%%//}
a// b// c// d
timf@xenbld[2948] ksh93
timf@xenbld[2784] echo ${LIST//\/\//}
a b c d
timf@xenbld[2785]
The method you suggested just replaces the last occurrence of // and not
all of them. The only workaround is to go back to using sed on systems
without ksh93.
ksh93 I think.
Posted by Tim Foster on November 13, 2008 at 03:37 PM GMT #
You're right, of course! I tested the simple case of only one filesystem to backup...
Posted by Dimitri on November 14, 2008 at 07:15 AM GMT #