|
Poor Man's Cluster - end the corruption
The putback of 6282725 hostname/hostid should be stored in the label introduces hostid checking when importing a pool.
If the pool was last accessed by another system, then the import is denied (of course can be overridden with the '-f' flag).
This is especially important to people rolling their own cluster's - the so-called poor man's cluster. What people were finding is:
1) clientA creates the pool (using shared storage)
2) clientA reboots/panics
3) clientB forcibily imports the pool
4) clientA comes back up
5) clientA automatically imports the pool via /etc/zfs/zpool.cache
At this point, both clientA and clientB have the same pool imported and both can write to it - however, ZFS is not designed
to have multiple writers (yet), so both clients will quickly corrupt the pool as both have a different view of the pool's state.
Now that we store the hostid in the label and verify the system importing the pool was the last one that accessed the pool, the
poor man's cluster corruption scenario mentioned above can no longer happen. Below is an example using shared storage over iSCSI.
In the example, clientA is 'fsh-weakfish', clientB is 'fsh-mullet'.
First, let's create the pool on clientA (assume both clients are already setup for iSCSI):
fsh-weakfish# zpool create i c2t01000003BAAAE84F00002A0045F86E49d0
fsh-weakfish# zpool status
pool: i
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-weakfish# zfs create i/wombat
fsh-weakfish# zfs create i/hulio
fsh-weakfish# zfs list
NAME USED AVAIL REFER MOUNTPOINT
i 154K 9.78G 19K /i
i/hulio 18K 9.78G 18K /i/hulio
i/wombat 18K 9.78G 18K /i/wombat
fsh-weakfish#
Note the enhanced information 'zpool import' reports on clientB:
fsh-mullet# zpool import
pool: i
id: 8574825092618243264
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-EY
config:
i ONLINE
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE
fsh-mullet# zpool import i
cannot import 'i': pool may be in use from other system, it was last accessed by
fsh-weakfish (hostid: 0x4ab08c2) on Tue Apr 10 09:33:07 2007
use '-f' to import anyway
fsh-mullet#
Ok, we don't want to forcibly import the pool until clientA is down. So after clientA (fsh-weakfish) has rebooted,
forcibly import the pool on clientB (fsh-mullet):
fsh-weakfish# reboot
....
fsh-mullet# zpool import -f i
fsh-mullet# zpool status
pool: i
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-mullet#
After clientA comes back up, we'll see this message via syslog:
WARNING: pool 'i' could not be loaded as it was last accessed by another system
(host: fsh-mullet hostid: 0x8373b35b). See: http://www.sun.com/msg/ZFS-8000-EY
And just to double check to make sure that pool 'i' is in fact not loaded:
fsh-weakfish# zpool list
no pools available
fsh-weakfish#
And to verify the pool has not been corrupted from clientB's view of the world, we see:
fsh-mullet# zpool scrub i
fsh-mullet# zpool status
pool: i
state: ONLINE
scrub: scrub completed with 0 errors on Tue Apr 10 10:28:03 2007
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-mullet# zfs list
NAME USED AVAIL REFER MOUNTPOINT
i 156K 9.78G 21K /i
i/hulio 18K 9.78G 18K /i/hulio
i/wombat 18K 9.78G 18K /i/wombat
fsh-mullet#
See you never again poor man's cluster corruption.
One detail i'd like to point out is that you have to be careful on *when* you forcibly import a pool. For instance,
if you forcibly import the pool on clientB *before* you reboot clientA then corruption can still happen. This is because
the command reboot(1M) cleanly takes down the machine, which means it unmounts all filesystems, and unmounting a
filesystem will write a bit of data to the pool.
To see the new information on the label, you can use zdb(1M):
fsh-mullet# zdb -l /dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
version=6
name='i'
state=0
txg=665
pool_guid=8574825092618243264
hostid=2205397851
hostname='fsh-mullet'
top_guid=5676430250453749577
guid=5676430250453749577
vdev_tree
type='disk'
id=0
guid=5676430250453749577
path='/dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0'
devid='id1,ssd@x01000003baaae84f00002a0045f86e49/a'
whole_disk=1
metaslab_array=14
metaslab_shift=26
ashift=9
asize=10724048896
DTL=30
--------------------------------------------
LABEL 1
--------------------------------------------
...
(2007-04-18 05:05:43.0/2007-04-10 10:57:47.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/poor_man_s_cluster_end
|
fsh-mullet# zpool import
pool: i
id: 8574825092618243264
state: ONLINE
status: The pool was last accessed by another system.
< .....>
Thanks for adding this cool feature!
- Ryan
Posted by Matty on April 10, 2007 at 01:05 PM PDT #
Posted by 192.18.43.225 on April 10, 2007 at 01:56 PM PDT #
Love, the blog fashionista.
Posted by What not to wear on your blog on April 10, 2007 at 02:00 PM PDT #
It should be possible to add the hostname to 'zpool import' (without specifiying a specific pool). I decided against it as it seemed to clutter up the output and was inconsistent with regards to how specific the other 'zpool import' errors existed.
You can get that information by trying to import the specific pool. If this isn't sufficient, let me know and i can see about adding it.
Posted by eric kustarz on April 10, 2007 at 05:25 PM PDT #
With regards to having to or not having to use the '-f' flag, yep, we've made it easier on you. If you were the last one to access the pool, then '-f' is no longer needed. Try destroying a pool and importing it via 'zpool import -D <pool>' - no more '-f'!
Posted by eric kustarz on April 10, 2007 at 05:27 PM PDT #
Fashion police - you're too funny!
when i have spare time, i'll check out the new themes...
Posted by eric kustarz on April 10, 2007 at 05:30 PM PDT #
Posted by David Smith on April 17, 2007 at 10:00 AM PDT #
Hey David,
It won't make s10u4 and the schedules for future updates haven't been settled yet. So i don't know yet.
Posted by eric kustarz on April 17, 2007 at 01:53 PM PDT #
I'm intrigued by the "yet". Are there any concrete plans to support that?
Posted by David Hopwood on May 27, 2007 at 06:33 AM PDT #
Nothing concrete. pNFS is going to be one solution and that is being actively worked on (prototype works and the NFSv4 wg is going to settle on the spec this summer).
Posted by eric kustarz on May 29, 2007 at 09:18 AM PDT #
Posted by Jeff on June 05, 2007 at 03:36 PM PDT #