|
Poor Man's Cluster - end the corruption
The putback of 6282725 hostname/hostid should be stored in the label introduces hostid checking when importing a pool.
If the pool was last accessed by another system, then the import is denied (of course can be overridden with the '-f' flag).
This is especially important to people rolling their own cluster's - the so-called poor man's cluster. What people were finding is:
1) clientA creates the pool (using shared storage)
2) clientA reboots/panics
3) clientB forcibily imports the pool
4) clientA comes back up
5) clientA automatically imports the pool via /etc/zfs/zpool.cache
At this point, both clientA and clientB have the same pool imported and both can write to it - however, ZFS is not designed
to have multiple writers (yet), so both clients will quickly corrupt the pool as both have a different view of the pool's state.
Now that we store the hostid in the label and verify the system importing the pool was the last one that accessed the pool, the
poor man's cluster corruption scenario mentioned above can no longer happen. Below is an example using shared storage over iSCSI.
In the example, clientA is 'fsh-weakfish', clientB is 'fsh-mullet'.
First, let's create the pool on clientA (assume both clients are already setup for iSCSI):
fsh-weakfish# zpool create i c2t01000003BAAAE84F00002A0045F86E49d0
fsh-weakfish# zpool status
pool: i
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-weakfish# zfs create i/wombat
fsh-weakfish# zfs create i/hulio
fsh-weakfish# zfs list
NAME USED AVAIL REFER MOUNTPOINT
i 154K 9.78G 19K /i
i/hulio 18K 9.78G 18K /i/hulio
i/wombat 18K 9.78G 18K /i/wombat
fsh-weakfish#
Note the enhanced information 'zpool import' reports on clientB:
fsh-mullet# zpool import
pool: i
id: 8574825092618243264
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-EY
config:
i ONLINE
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE
fsh-mullet# zpool import i
cannot import 'i': pool may be in use from other system, it was last accessed by
fsh-weakfish (hostid: 0x4ab08c2) on Tue Apr 10 09:33:07 2007
use '-f' to import anyway
fsh-mullet#
Ok, we don't want to forcibly import the pool until clientA is down. So after clientA (fsh-weakfish) has rebooted,
forcibly import the pool on clientB (fsh-mullet):
fsh-weakfish# reboot
....
fsh-mullet# zpool import -f i
fsh-mullet# zpool status
pool: i
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-mullet#
After clientA comes back up, we'll see this message via syslog:
WARNING: pool 'i' could not be loaded as it was last accessed by another system
(host: fsh-mullet hostid: 0x8373b35b). See: http://www.sun.com/msg/ZFS-8000-EY
And just to double check to make sure that pool 'i' is in fact not loaded:
fsh-weakfish# zpool list
no pools available
fsh-weakfish#
And to verify the pool has not been corrupted from clientB's view of the world, we see:
fsh-mullet# zpool scrub i
fsh-mullet# zpool status
pool: i
state: ONLINE
scrub: scrub completed with 0 errors on Tue Apr 10 10:28:03 2007
config:
NAME STATE READ WRITE CKSUM
i ONLINE 0 0 0
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE 0 0 0
errors: No known data errors
fsh-mullet# zfs list
NAME USED AVAIL REFER MOUNTPOINT
i 156K 9.78G 21K /i
i/hulio 18K 9.78G 18K /i/hulio
i/wombat 18K 9.78G 18K /i/wombat
fsh-mullet#
See you never again poor man's cluster corruption.
One detail i'd like to point out is that you have to be careful on *when* you forcibly import a pool. For instance,
if you forcibly import the pool on clientB *before* you reboot clientA then corruption can still happen. This is because
the command reboot(1M) cleanly takes down the machine, which means it unmounts all filesystems, and unmounting a
filesystem will write a bit of data to the pool.
To see the new information on the label, you can use zdb(1M):
fsh-mullet# zdb -l /dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
version=6
name='i'
state=0
txg=665
pool_guid=8574825092618243264
hostid=2205397851
hostname='fsh-mullet'
top_guid=5676430250453749577
guid=5676430250453749577
vdev_tree
type='disk'
id=0
guid=5676430250453749577
path='/dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0'
devid='id1,ssd@x01000003baaae84f00002a0045f86e49/a'
whole_disk=1
metaslab_array=14
metaslab_shift=26
ashift=9
asize=10724048896
DTL=30
--------------------------------------------
LABEL 1
--------------------------------------------
...
(2007-04-18 05:05:43.0/2007-04-10 10:57:47.0)
Permalink
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/poor_man_s_cluster_end
|