A question was recently posted in zfs-discuss@opensolaris.org on the subject of AVS replication vs ZFS send receive for odd sized volume pairs, and does the use of AVS make it all seamless? Yes, the use of Availability Suite makes it all seamless, but only after AVS is initially configured.
Unlike ZFS, which was designed and developed to be very easy to configure, Availability Suite requires explicit and somewhat overly detailed configuration information to be setup, and setup correctly for it to work seamlessly.
Recently I worked with one of Sun's customers involving the configuration of two Sun Fire x4500 servers, a remarkably performing system, being a four-way x64 server, with the highest storage density available, being 24
TB in 4U of rack space. The customer's desired configuration was simple, two servers, in an active - active, high availability configuration, deployed 2000 km apart, with each system acting as the disaster recovery system for the other. Replication needed to be CDP, Continuous Data Protection, offering 24/7 by 365, in both directions, and once setup correctly, CDP would work seamlessly, and be a lights out operation.
Each x4500, or Thumper, comes with 48 disks, two of which will be used as the SVM mirrored system disk, (can't have a single point of failure), leaving 46 data disks. Since each system's configuration will be the disaster recovery system for the other site, this leaves 23 disks available on each system as data disks. The decision as to what type of ZFS provided redundancy, the number of volumes in each pool, if compression or encryption is enabled, is not a concern to Availability Suite, since whatever vdevs are configured, the ZFS volume and file metadata will get replicated too.
For testing out this replicated ZFS on AVS scenario in on my Thumper, here are the steps followed:
1). Take one of the 46 disks that will eventually be placed in the ZFS storage pool. Use the ZFS zpool utility to correctly format this disk, and action which will create a EFI labeled disk, with all available blocks in slice 0. Then delete the pool.
# zpool create -f temp c4t2d0; zpool destroy temp
2). Next run the AVS 'dsbitmap' utility to determine the size of an SNDR bitmap to replicate this disk's slice 0, saving the results for later use.
# dsbitmap -r /dev/rdsk/c4t2d0s0 | tee /tmp/vol_size
Remote Mirror bitmap sizing
Data volume (/dev/rdsk/c4t2d0s0) size: 285196221 blocks
Required bitmap volume size:
Sync replication: 1089 blocks
Async replication with memory queue: 1089 blocks
Async replication with disk queue: 9793 blocks
Async replication with disk queue and 32 bit refcount: 35905 blocks
Remote Mirror bitmap sizing
Selection will be for either synchronous replication with memory queues. Other replication types also work with ZFS, but synchronous replication is best, is network latency is low.
3). To assure redundancy of the SNDR bitmap, each will be mirrored via SVM, hence we will need to double the number of blocks needed, rounded up to a multiple of 8KB or 16 blocks
# VOL_SIZE="`cat /tmp/vol_size| grep 'size: [0-9]' | awk '{print $5}'`"
# BMP_SIZE="`cat /tmp/vol_size| grep 'Sync ' | awk '{print $3}'`"
# SVM_SIZE=$((((BMP_SIZE+((16-1)/16))*16)*2))
# ZFS_SIZE=$((VOL_SIZE-SVM_SIZE))
# SVM_OFFS=$(((34+ZFS_SIZE)))
# echo "Original volume size: $VOL_SIZE, Bitmap size: $BMP_SIZE"
# echo "SVM soft partition size: $SVM_SIZE, ZFS vdev size: $ZFS_SIZE"
5). Use the 'find' utility below, adjusting its first parameter to produce the list of volumes that will be placed into the ZFS storage pool. Carefully examine this list, and adjust the first search parameter and/or use 'egrep -v "disk|disk"', for one or disks to exclude from this list any volumes that are not to be part of this ZFS storage pool configuration.
This resulting list produced by "find ...", is key in reformatting all of the LUNs that will be part of a replicated ZFS storage pool.
# find /dev/rdsk/c[45]*s0
or
# find /dev/rdsk/c[45]*s0 | egrep -v "c4t2d0s0|c4t3d0s0"
6). Re-use the corrected find command from above as the driver to change the format of all of those volumes.
# find /dev/rdsk/c[45]*s0 | xargs -n1 fmthard -d 0:4:0:34:$ZFS_SIZE
# find /dev/rdsk/c[45]*s0 | xargs -n1 fmthard -d 1:4:0:$SVM_OFFS:$SVM_SIZE
# find /dev/rdsk/c[45]*s0 | xargs -n1 prtvtoc |egrep "^ [01]|partition map"
7). Re-use the corrected find command from above, with the additional selection of only even numbered disks, placing slice 1 of all selected disks into the SVM metadevice d101
# find /dev/rdsk/c[45]*[24680]s1 | xargs -I {} echo 1 $1\{} | xargs metainit d101 `find /dev/rdsk/c[45]*[24680]s1 | wc -l`
8). Re-use the corrected find command from above, with the additional selection of only odd numbered disks, placing slice 1 of all selected disks into the SVM metadevice d102
# find /dev/rdsk/c[45]*[13579]s1 | xargs -I {} echo 1 $1\{} | xargs metainit d102 `find /dev/rdsk/c[45]*[13579]s1 | wc -l`
9). Now mirror metadevice d101 and d102, into mirror d100, ignoring the WARNING that both sides of the mirror will not be the same. When the bitmap volumes are createD, they will be initialized, at which time both sides of the mirror will be equal.
# metainit d100 -m d101 d102
10). Now from the mirror SVM storage pool, allocate bitmap volumes out of SVM soft paritions for each SNDR replica
# OFFSET=1
# for n in `find /dev/rdsk/c[45]*s1 | grep -n s1 | cut -d ':' -f1 | xargs`
do
metainit d$n -p /dev/md/rdsk/d100 -o $OFFSET -b $BMP_SIZE
OFFSET=$(((OFFSET + BMP_SIZE + 1)))
done
11). Repeat steps 1 - 10 on the SNDR remote system (NODE-B)
12). Generate the SNDR enable on NODE-A
# DISK=1
# for ZFS_DISK in `find /dev/rdsk/c[45]*s0`
do
sndradm -nE $NODE-A $ZFS_DISK /dev/md/rdsk/d$DISK NODE-B $ZFS_DISK /dev/md/rdsk/d$DISK ip sync g zfs-pool
DISK=$(((DISK + 1)))
done
13). Repeat step 12 on NODE-B
14). Perform then ZPOOL enables
# find /dev/rdsk/c[45]*s0 | xargs zpool create zfs-pool
15). Enable SNDR replication, and take a look at what you have done!
# sndradm -g zfs-pool -nu
# sndradm -g zfs-pool -P
# metastat -P
# zpool status zfs-pool
I'm trying this on Solaris Express b77
Step 14 doesn't work. I get the error:-
cannot use '/dev/rdsk/c3d0s0': must be a block device or regular file
"sndradm -g zfs-pool -P" should be a lowercase "-p"
Despite this I cannot get it to work. I can set it up OK on both machines, but nothing is being replicated between nodes at all.
Posted by Nathan on December 26, 2007 at 11:28 PM EST #
Sorry, I should have said "metastat -P" should be "metastat -p"
I kickstarted everything off with a full volume copy "sndradm -m" and then turned autosync on with "sndradm -a on" on both nodes. Now there's a lot of activity.
Posted by Nathan on December 27, 2007 at 04:15 AM EST #
SVM_SIZE=$((((((BMP_SIZE+(16-1) / 16) * 16 ) * 2)))
is missing one ')'
But that's not a problem.
This is just a simple shell scripting advice and not meant badly, just to add my $0.02:
for n in `find /dev/rdsk/c[45]*s1 | grep -n s1 | cut -d ':' -f1 | xargs`
The '| xargs' part is not necessary, for will handle it. Anyways, I see it's quite a strage way to write
for n in `seq 1 $(find /dev/rdsk/c[45]*s1|wc -l)`
I mean mainly semantically, you better understand what it does when you see it.
Anyways, great howto, thank you.
Posted by Juraj Bednar on July 23, 2008 at 04:14 PM EST #
A couple of things that are not mentioned here is the SVM has to have its database replicas initialized and sndr also needs an initialization (dscfgadm).
Another point is that if you can tolerate devoting 64MB per disk to the bitmap volume, you can use ZFS instead of SVM. It seems to be simpler to have it handle the mirroring of them rather than SVM.
Posted by Maurice Volaski on July 30, 2008 at 07:45 PM EST #
Re: use zfs instead of SVM.
After using ZFS for quite a while and with volumes my experience that zfs COW would be slower than SVM would be in this case. SVM is already introduced in the system for root, the cost of entry is already there for the administrator...
Posted by Wade Stuart on September 04, 2008 at 10:05 AM EST #
What about the new Sun Storage 7000 series in this setup?
Maybe we can put the bitmap volume in Log SSD in zfs?
Posted by kn7 on March 18, 2009 at 08:57 AM EST #
With this example can you expand the zpool with additional mirrored pairs? Do you have to change the SNDR bitmap?
Posted by Abby on March 28, 2009 at 01:59 PM EST #
Yes, with this example can you expand the zpool with additional mirrored pairs. Prior to the additional pairs being added to ZFS's storage pool, add them to SNDR first using the "sndradm -E ...", equal enable command. Since these new vdevs contained uninitialized data, as they are not in use by ZFS yet, there is no needed to replicate them to the SNDR secondary node.
In regard to "Do you have to change the SNDR bitmap?", the answer is no. One needs to provision a new SNDR bitmap per volume.
Posted by James Dunham on March 28, 2009 at 09:28 PM EST #
I wonder why you didn't use the suggestion from avs administration guide: raw devices must be stored on a disk separate from the disk that contains the data from the replicated volumes.... The bitmap must not be stored on the same disk as replicated volumes
Besides reliability also performance may suffer (they even suggest to place bitmaps on caching array)
Posted by Roman on April 15, 2009 at 10:02 AM EST #