Eric Kustarz's Weblog

e-street

All | FileBench | NFS | SETUP | ZFS

20070410 Tuesday April 10, 2007

 Poor Man's Cluster - end the corruption

The putback of 6282725 hostname/hostid should be stored in the label introduces hostid checking when importing a pool.
If the pool was last accessed by another system, then the import is denied (of course can be overridden with the '-f' flag).

This is especially important to people rolling their own cluster's - the so-called poor man's cluster. What people were finding is:

1) clientA creates the pool (using shared storage)
2) clientA reboots/panics
3) clientB forcibily imports the pool
4) clientA comes back up
5) clientA automatically imports the pool via /etc/zfs/zpool.cache

At this point, both clientA and clientB have the same pool imported and both can write to it - however, ZFS is not designed
to have multiple writers (yet), so both clients will quickly corrupt the pool as both have a different view of the pool's state.

Now that we store the hostid in the label and verify the system importing the pool was the last one that accessed the pool, the
poor man's cluster corruption scenario mentioned above can no longer happen. Below is an example using shared storage over iSCSI.
In the example, clientA is 'fsh-weakfish', clientB is 'fsh-mullet'.

First, let's create the pool on clientA (assume both clients are already setup for iSCSI):

fsh-weakfish# zpool create i c2t01000003BAAAE84F00002A0045F86E49d0
fsh-weakfish# zpool status
  pool: i
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-weakfish# zfs create i/wombat
fsh-weakfish# zfs create i/hulio 
fsh-weakfish# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
i          154K  9.78G    19K  /i
i/hulio     18K  9.78G    18K  /i/hulio
i/wombat    18K  9.78G    18K  /i/wombat
fsh-weakfish#

Note the enhanced information 'zpool import' reports on clientB:

fsh-mullet# zpool import
  pool: i
    id: 8574825092618243264
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        i                                        ONLINE
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE
fsh-mullet# zpool import i
cannot import 'i': pool may be in use from other system, it was last accessed by
fsh-weakfish (hostid: 0x4ab08c2) on Tue Apr 10 09:33:07 2007
use '-f' to import anyway
fsh-mullet#

Ok, we don't want to forcibly import the pool until clientA is down. So after clientA (fsh-weakfish) has rebooted,
forcibly import the pool on clientB (fsh-mullet):

fsh-weakfish# reboot
....

fsh-mullet# zpool import -f i
fsh-mullet# zpool status
  pool: i
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-mullet#

After clientA comes back up, we'll see this message via syslog:

WARNING: pool 'i' could not be loaded as it was last accessed by another system
(host: fsh-mullet hostid: 0x8373b35b).  See: http://www.sun.com/msg/ZFS-8000-EY

And just to double check to make sure that pool 'i' is in fact not loaded:

fsh-weakfish# zpool list
no pools available
fsh-weakfish# 

And to verify the pool has not been corrupted from clientB's view of the world, we see:

fsh-mullet# zpool scrub i
fsh-mullet# zpool status
  pool: i
 state: ONLINE
 scrub: scrub completed with 0 errors on Tue Apr 10 10:28:03 2007
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-mullet# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
i          156K  9.78G    21K  /i
i/hulio     18K  9.78G    18K  /i/hulio
i/wombat    18K  9.78G    18K  /i/wombat
fsh-mullet# 

See you never again poor man's cluster corruption.

One detail i'd like to point out is that you have to be careful on *when* you forcibly import a pool. For instance,
if you forcibly import the pool on clientB *before* you reboot clientA then corruption can still happen. This is because
the command reboot(1M) cleanly takes down the machine, which means it unmounts all filesystems, and unmounting a
filesystem will write a bit of data to the pool.

To see the new information on the label, you can use zdb(1M):

fsh-mullet# zdb -l /dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=6
    name='i'
    state=0
    txg=665
    pool_guid=8574825092618243264
    hostid=2205397851
    hostname='fsh-mullet'
    top_guid=5676430250453749577
    guid=5676430250453749577
    vdev_tree
        type='disk'
        id=0
        guid=5676430250453749577
        path='/dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0'
        devid='id1,ssd@x01000003baaae84f00002a0045f86e49/a'
        whole_disk=1
        metaslab_array=14
        metaslab_shift=26
        ashift=9
        asize=10724048896
        DTL=30
--------------------------------------------
LABEL 1
--------------------------------------------
...


(2007-04-18 05:05:43.0/2007-04-10 10:57:47.0) Permalink Comments [11]
Trackback: http://blogs.sun.com/erickustarz/en_US/entry/poor_man_s_cluster_end


« April 2007 »
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today


XML





Today's Page Hits: 252