How 'suite' it is... - Jackie Gleason The "Availability Suite"

Tuesday Jun 12, 2007

A question was recently posted in zfs-discuss@opensolaris.org on the subject of AVS replication vs ZFS send receive for odd sized volume pairs, and does the use of AVS make it all seamless? Yes, the use of Availability Suite makes it all seamless, but only after AVS is initially configured.

Unlike ZFS, which was designed and developed to be very easy to configure, Availability Suite requires explicit and somewhat overly detailed configuration information to be setup, and setup correctly for it to work seamlessly.

Recently I worked with one of Sun's customers involving the configuration of two Sun Fire x4500 servers, a remarkably performing system, being a four-way x64 server, with the highest storage density available, being 24
TB in 4U of rack space. The customer's desired configuration was simple, two servers, in an active - active, high availability configuration, deployed 2000 km apart, with each system acting as the disaster recovery system for the other. Replication needed to be CDP, Continuous Data Protection, offering 24/7 by 365, in both directions, and once setup correctly, CDP would work seamlessly, and be a lights out operation.

Each x4500, or Thumper, comes with 48 disks, two of which will be used as the SVM mirrored system disk, (can't have a single point of failure), leaving 46 data disks. Since each system's configuration will be the disaster recovery system for the other site, this leaves 23 disks available on each system as data disks. The decision as to what type of ZFS provided redundancy, the number of volumes in each pool, if compression or encryption is enabled, is not a concern to Availability Suite, since whatever vdevs are configured, the ZFS volume and file metadata will get replicated too.

For testing out this replicated ZFS on AVS scenario in on my Thumper, here are the steps followed:

1). Take one of the 46 disks that will eventually be placed in the ZFS storage pool. Use the ZFS zpool utility to correctly format this disk, and action which will create a EFI labeled disk, with all available blocks in slice 0. Then delete the pool.

# zpool create -f temp c4t2d0; zpool destroy temp

2). Next run the AVS 'dsbitmap' utility to determine the size of an SNDR bitmap to replicate this disk's slice 0, saving the results for later use.

# dsbitmap -r /dev/rdsk/c4t2d0s0 | tee /tmp/vol_size
Remote Mirror bitmap sizing

Data volume (/dev/rdsk/c4t2d0s0) size: 285196221 blocks
Required bitmap volume size:
  Sync replication: 1089 blocks
  Async replication with memory queue: 1089 blocks
  Async replication with disk queue: 9793 blocks
  Async replication with disk queue and 32 bit refcount: 35905 blocks
Remote Mirror bitmap sizing

Selection will be for either synchronous replication with memory queues. Other replication types also work with ZFS, but synchronous replication is best, is network latency is low.

3). To assure redundancy of the SNDR bitmap, each will be mirrored via SVM, hence we will need to double the number of blocks needed, rounded up to a multiple of 8KB or 16 blocks

# VOL_SIZE="`cat /tmp/vol_size| grep 'size: [0-9]' | awk '{print $5}'`"
# BMP_SIZE="`cat /tmp/vol_size| grep 'Sync ' | awk '{print $3}'`"
# SVM_SIZE=$((((((BMP_SIZE+(16-1) / 16) * 16 ) * 2)))
# ZFS_SIZE=$((VOL_SIZE-SVM_SIZE))
# SVM_OFFS=$(((34+ZFS_SIZE)))
# echo "Original volume size: $VOL_SIZE, Bitmap size: $BMP_SIZE"
# echo "SVM soft partition size: $SVM_SIZE, ZFS vdev size: $ZFS_SIZE"

5). Use the 'find' utility below, adjusting its first parameter to produce the list of volumes that will be placed into the ZFS storage pool. Carefully examine this list, and adjust the first search parameter and/or use 'egrep -v "disk|disk"', for one or disks to exclude from this list any volumes that are not to be part of this ZFS storage pool configuration.

This resulting list produced by "find ...", is key in reformatting all of the LUNs that will be part of a replicated ZFS storage pool.

# find /dev/rdsk/c[45]*s0
    or
# find /dev/rdsk/c[45]*s0 | egrep -v "c4t2d0s0|c4t3d0s0"

6). Re-use the corrected find command from above as the driver to change the format of all of those volumes.

# find /dev/rdsk/c[45]*s0 | xargs -n1 fmthard -d 0:4:0:34:$ZFS_SIZE
# find /dev/rdsk/c[45]*s0 | xargs -n1 fmthard -d 1:4:0:$SVM_OFFS:$SVM_SIZE
# find /dev/rdsk/c[45]*s0 | xargs -n1 prtvtoc |egrep "^       [01]|partition map"
    
7). Re-use the corrected find command from above, with the additional selection of only even numbered disks, placing slice 1 of all selected disks into the SVM metadevice d101

# find /dev/rdsk/c[45]*[24680]s1 | xargs -I {} echo 1 $1\{} | xargs metainit d101 `find /dev/rdsk/c[45]*[24680]s1 | wc -l`

8). Re-use the corrected find command from above, with the additional selection of only odd numbered disks, placing slice 1 of all selected disks into the SVM metadevice d102

# find /dev/rdsk/c[45]*[13579]s1 | xargs -I {} echo 1 $1\{} | xargs metainit d102 `find /dev/rdsk/c[45]*[13579]s1 | wc -l`

9). Now mirror metadevice d101 and d102, into mirror d100, ignoring the WARNING that both sides of the mirror will not be the same. When the bitmap volumes are createD, they will be initialized, at which time both sides of the mirror will be equal.

# metainit d100 -m d101 d102

10). Now from the mirror SVM storage pool, allocate bitmap volumes out of SVM soft paritions for each SNDR replica

# OFFSET=1
# for n in `find /dev/rdsk/c[45]*s1 | grep -n s1 | cut -d ':' -f1 | xargs`
do
    metainit d$n -p /dev/md/rdsk/d100 -o $OFFSET -b $BMP_SIZE
    OFFSET=$(((OFFSET + BMP_SIZE + 1)))
done

11). Repeat steps 1 - 10 on the SNDR remote system (NODE-B)

12). Generate the SNDR enable on NODE-A

# DISK=1
# for ZFS_DISK in `find /dev/rdsk/c[45]*s0`
do
    sndradm -nE $NODE-A $ZFS_DISK /dev/md/rdsk/d$DISK NODE-B $ZFS_DISK /dev/md/rdsk/d$DISK ip sync g zfs-pool
    DISK=$(((DISK + 1)))
done
 
13).  Repeat step 12 on NODE-B

14). Perform then ZPOOL enables

# find /dev/rdsk/c[45]*s0 | xargs zpool create zfs-pool        

15). Enable SNDR replication, and take a look at what you have done!

# sndradm -g zfs-pool -nu
# sndradm -g zfs-pool -P
# metastat -P
# zpool status zfs-pool


The face of Sun StorageTek Availability Suite has changed quite a bit since June '06, when AVS 4.0 was released, supporting Solaris 10 on SPARC and x64/x86 platforms. In February '07, Availabilty Suite became an OpenSolaris Project, and now in June '07, Availability Suite reaches yet another new milestone, being a product offering in the Try Sun Products Free for 60 Days program.

The Sun StorageTek Availability Suite try for 60 days program provides a pair of Solaris host-based data services, supporting all Solaris file systems, most Solaris databases, plus Sun and 3rd-party applications. Availability Suite works with and across all Solaris volumes managers, SVM, ZFS's zvols, any block storage devices, being direct-attached, Fibre Channel, iSCSI, all independent of the underlying level of redundancy, or the physical device type  or storage array.

Sun StorageTek Availability Suite Point-in-Time Copy software creates and instantly accessible, for both reading and writing, an independent, dependent or compact dependent copy (no clones needed) of ANY volume on ANY storage. Creation can be 1-to-1, or 1-to-many, and the many can be any of the supported copy types. The resulting shadow volumes can be used on the local host, or if an independent copy was created on dual-host or SAN accessible storage, in can be used on any other similarly connected host, in both a read and write manner, without loosing the ability for a fast-resynchronization, avoiding a full-copy later on.

Sun StorageTek Availability Suite Remote Mirror Copy software enables real-time synchronous or asynchronous data replication to either local campus, metro, or remote data centers. Remote Mirror copies can be 1-to-1, 1-to-many or multi-hop as in A-to-B, then B-to-C. As part of its disaster recovery features, it supports both role-reversal (primary and secondary node swap roles), or on-demand reverse synchronization, where instantly after invoking a secondary to primary copy, the primary volume can be accesses for read and write, where un-replicated blocks are fetched on-demand.

The Sun StorageTek Availability Suite Point-in-Time Copy  & Remote Mirror Copy software are fully integrated, providing features such as point-in-time vs. real-time replication, data migration between Solaris 8, 9, 10 and OpenSolaris platforms. Availability Suite will also become part of Solaris 11 (Nevada) before the end of Q4FY07 (this month).

Sun StorageTek Availability Suite is fully integrated with Solaris Cluster, Solaris Cluster GeoGrapic Edition and Netra High Availability Suite, as a key component in bringing high availability to these product offerings.


Friday Apr 13, 2007

As a member of the SAN Software Organization, I am one of many engineers that are part of a world class and world wide Solaris Software organization. A key part of this organization is a group of individuals in ERI, the Beijing Engineering Research Institute. For the most part, I have only known of these individuals in ERI through email, conference calls, digital pictures, merely an electronic representation of who they actually are, well that is until this past week.

Last Friday I arrived in Beijing, People's Republic of China, with high expectations as to what I wanted to accomplished while being here in Beijing and what I wanted to get in return for my effort. Very fortunately for me, I discovered on the very first day of this trip that my time here in Beijing had nothing to do with what I wanted at all. Instead this trip was all about the team in Beijing, talking, listening, and listening again, engaged in the rewarding process of building peer and personal relationships. In a very unique way, this experience is something that one could have only experienced first hand, immersed in Chinese culture, establishing relationships that instantly become valuable keepsakes.

Upon entering the ERI office for the very first time, I was welcomed as though I had been a long-time friend, returning after being gone for quite some time. Unlike Bruce Tuckman's four stages of team development, as I met and had a chance to talk with each ERI team member, we would instantly leap past those early stages of team development, straight to performing. Even when going out to lunch, the conversation was rich and full of flavor, not unlike every single dish of Chinese cuisine I had over the week, but each conversation being far more valuable then the $4.00 (U.S.) one would often pay for lunch.

In retrospect, over the course of my week in Beijing, each additional conversation would paint a new section of a mural, which although far from complete, grows in grandeur every single day. My personal mural for Beijing, not unlike the mural of the Great Wall of China one sees upon entering the Beijing Airport, or actually visiting Simatai, is much larger then one can see, but as with each section of the Great Wall, each section is unique, much like those I have come to know better during my time in Beijing this week.

Although I had the highest expectations about my trip to Beijing and ERI, without a doubt they were exceeded 100 fold.

Wednesday Mar 28, 2007

AVS & ZFS Demo - Parts 1 & 2

This demonstration covers the steps required for an initial
SNDR configuration of a mirrored ZFS storage pool. Once configured,
both secondary node access and reverse update processing are covered.

Sunday Feb 18, 2007

If you are looking for a single reason to consider adding Availability Suite to Solaris Express server, it would have to be the ability to perform the equivalent operation of the following dd(1m) command, but in just a fraction of a second, instead of its usual time of minutes, hours, or even days.

# dd if=/dev/rdsk/c1t0d0s1 of=/dev/rdsk/c2t0d0s1

The Point-in-Time Copy software (a.k.a, II or Instant Image) does this with the following command, and in just a fraction of a second, allowing both volumes to be instantly accessed for both read and write operations.

# iiadm -e dep /dev/rdsk/c1t0d0s1 /dev/rdsk/c2t0d0s1 /dev/rdsk/c2t0d1s1

Yes there is the addition of another disk partition (or other Solaris block device), being the II bitmap volume, but it is a volume that brings about the true capabilities of the Point-in-Time Copy software.

Fast-resynchronization - Ability to update either volume based only on the differences (caused by writes) to either or both volumes. Unlike the dd(1m) utility, where if one needs to perform the operation again another full volume copy is done, with II the operation still allows read or write access to both  , only the minimal

Independent Shadow Volume - Although initially dependent on the master volume, an independent shadow volume will in time become independent, allowing one to disable the set if they choose to, while retaining the Point-in-Time data. An independent set allows exporting the shadow volume from the current set, importing it on another Solaris host for both read and write access (backups, off-host process, etc.), and later returning the volume to the original host, joining it back with the master, leaving it a state as though it had never been exported, including the ability to perform fast-resynchronizations.

Dependent Shadow Volumes - Always dependent on the master volume, a dependent shadow volume has no background copy operation, always performing COW (copy-on-write) functionality. Due to the fact that only a portion of the shadow volume may be used (caused by writes), one can configure a compact dependent shadow volume, say at 25% the size of the original master volume size. If one had to configure four 1TB volumes, expecting no more then 25% change over the life of any Point-in-Time, an additional 1TB volume is all that is needed, something not possible with RAID-1 mirroring. If there was a chance that any or all of the volumes could exceed the 25% change, there is the means to associate a shared overflow volume, just in case.

There are no limits, other then Solaris limits, as to the number of shadow volumes that can be configured, including multiple shadow volumes of the same master volume. Each Point-in-Time Copy operates by itself, unless one or more sets are placed in an I/O consistency group, something needed for all the volumes in 'named' ZFS storage pool, QFS filesystems, or other data services where multiple LUNs act as one.

Besides supporting all Solaris filesystems, databases and applications, the volumes configured can be any Solaris block device (.../rdsk/...), including SVM, LUNs, lofi, zvols, independent of RAID types.

Last week I was asked to talk to others at work about the effort it took to bring the Availability Suite product set to open source. As a long time development engineer, I talked about the planning process, deliverables, the time and resources it took to complete the effort, things that were easy, some that were hard.

Unfortunately, I forgot to mention the most important reason that the effort was so successful, and that was the combined efforts of all the people, that at one time or another contributed to the Availability Suite product set.

Amy O., Andrei D., Bill B., Bill L., Blaise C., Butch M., Chhandomay M., Chris J., Colleen G., Dan M., David V., Deb C., Denise L., Dennis V., Ed P., Eric R., Frank F., George Q., Holly G., Howard N., Jay P., Jeff C., Jeff P., Jeff W., Jesse B., Jillian D., Jim D., Jim G., Jim K., Joanne D., John C., John T., Jon D., Jon F., Karen M., Kate T., Keith B., Ken D., Laurie C., Laurie T., Lee G., Marc R., Marcus Y., Mark B., Mark C., Mark M., Matt C., Melinda N., Melora G., Michelle H., Nancy G., Nancy Q., Patrick A., Paul H., Paul H., Peter G., Peter H., Peter W., Phil N., Phil P., Phoebe C., Pryia V., Rich T., Roberta P., Rowan D., Russ L., Salim A., Scott T., Sean C., Sherri S., Simon C., Sridhar R., Stephen S., Steve C., Sue S., Sue D., Toni R., Umang K., Vahid K., Vilas K., Vince E., Yuantai D.

Thank you all!

Friday Feb 02, 2007

If you have it and you know you have it, then you
have it. If you have it and don't know you have it, you don't have it.
If you don't have it but you think you have it, then you have it.


Welcome to OpenSolaris, and "Day 2" of Availability Suite.

So you have it, that is the TarBall which contains the source code of the OpenSolaris release of Availability Suite. What you will find when expanding this TarBall is a the top level
directory with the following files, plus the single directory 'src'. 

 Filename Description
 READMEThis 'README' file 
 opensolaris.license.txtThe OpenSolaris License - CDDL 
 openAVS_buildThe Availability Suite (AVS) build script 
 openAVS.releaseEnvironmental file for 'release' builds
 openAVS.debug Environmental file for 'debug' builds
 openAVS.lint Environmental file for 'lint' builds
 srcThe top-level source directory for all software 











The openAVS_build script is invoked from the top level directory, specifying one of the pre-existing environmental files, (of which three are provided), or as a CLI, passing one or more of the following options via the command line:

Usage:
    openAVS_build -E <environment file>
    openAVS_build -hilpvDN

      -E <environment file>    Path to environment file
      -h                       Prints the Usage
      -i                       Incremental build [no clobber]
      -l                       Enable lint checking
      -p                       Build packages
      -v                       Verbose processing
      -D                       Build Debug binaries
      -N                       Build Non Debug binaries

Upon completion any build, the top level directory 'log', will contain a dated logfile containing the results of build attempted. Upon successful completion any build, the top level directory 'packages', will contain one or more sub-directories consisting of a set of eight(8) packages per build type, where the build type will differ between SPARC & i386, and also between nightly-debug, and nightly-nondebug.

If you have it and you know you have it, then you
have it. If you have it and don't know you have it, you don't have it.
If you don't have it but you think you have it, then you have it.


Welcome to OpenSolaris, and "Day 1" of Availability Suite.

So you have it, that is the TarBall which contains binary images the OpenSolaris release of Availability Suite. What you will find when expanding this TarBall is a set of parallel directories that contain the SPARC and x64/x86 versions of Availability Suite, a collection consisting of the following eight packages per directory.


  • SUNWscmr  - Cache Management (root)
  • SUNWscmu  - Cache Management (usr)
  • SUNWspsvr - Storage Volume Driver (root)
  • SUNWspsvu - Storage Volume Driver (usr)
  • SUNWiir   - Point-In-Time Copy (root)
  • SUNWiiu   - Point-In-Time Copy (usr)
  • SUNWrdcr  - Remote Mirror (root)
  • SUNWrdcu  - Remote Mirror (usr)

From a high-level point of view, the first four packages provide the I/O filter driver framework, whereas the remaining four packages; Point-In-Time Copy and Remote Mirror software, provide the two data services of Availability Suite.

The package installation order is as shown above, and also as follows:

pkgadd -d . SUNWscmr SUNWscmu SUNWspsvr SUNWspsvu SUNWiir SUNWiiu SUNWrdcr SUNWrdcu


Once the packages have been added successful, invoking the utility 'dscfgadm' will perform a one-time persistance database creation, and then ask if the data services should be started, of which one will likely answer yes.

There is no need for a Solaris reboot.

For a quick introduction to the two data services try the following, but eventually you may want to read the Availability Suite documentation.

  • # sndradm -h
    # man sndradm
  • # iiadm -h
    # man iiadm

Package removal order is the just the opposite add, and is also as follows:

pkgrm SUNWrdcu SUNWrdcr SUNWiiu SUNWiir SUNWspsvu SUNWspsvr SUNWscmu SUNWscmr

Again, there is no need for a Solaris reboot.

Saturday Jan 27, 2007

Since Availability Suite has the means to replicate data between two or more computers, it is often presented with the situation of operating in a mixed version environment, and more recently with Solaris 10 and OpenSolaris, a mixed architecture environment.

Availability Suite supports SNDR replication between all shipping versions of the product, being AVS 3.2 for Solaris 8 and 9, AVS 4.0 for Solaris 10, and now (or very soon to be now), AVS 4.1 for OpenSolaris. This capability offers a vast array of replication options, options available without being forced to upgrade to a new version of the product, or new version of Solaris.

With the support of both SPARC and x64/86 architectures in Availability Suite 4.x, it presents the option of replicating data between the two different architectures supported by Solaris. At the onset this may seem like a good thing, but as it turns out only one Solaris filesystem is endian neutral at the block level, and that filesystem is ZFS!

As quoted from the paper “ZFS: the last word in file systems”, available on OpenSolaris web site at: http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf

“ZFS is supported on both SPARC and x86 platforms. More important, ZFS is endian-neutral. You can easily move disks from a SPARC server to an x86 server. Neither architecture pays a byte-swapping tax due to Sun's patent-pending "adaptive endian-ness" technology, which is unique to ZFS.”

This ability to use Availability Suite to replicate ZFS filesystems between different Solaris architectures, is yet another example of the sum of the parts (AVS & ZFS), being greater then the whole!

As ZFS becomes the filesystem of choice, having the means to replicate it with Availability Suite, without regard to the architecture of the system on the target host, will be a key differentiator of OpenSolaris, as the storage platform of choice.

The Availability Suite product set, or AVS by those that work with only in acronyms, has been a product set at Sun Microsystems Inc. since the time of Solaris 2.6. Although AVS has supported host-based replication (using SNDR) and snapshots (using II) for UFS, QFS and other Solaris filesystems, the existence of both ZFS and AVS in OpenSolaris opens new doors of opportunity for both of these technologies.

Since ZFS brings the features and functionality of a filesystem and a volume manager together, and offers its own implementations of snapshot and data replication, these latter two ZFS features do not preclude the use of Availability Suite with ZFS. Both products offer unique snapshot and replication features and functionality that work well together, not against each other.

Unit of Replication

As with any Solaris supported filesystem, AVS provides the means to establish both snapshot and remote volume replication, and with ZFS this is no different. With ZFS being both a filesystem and volume manager, there are a couple of things that one must be aware of when using AVS and ZFS together.

The component of replication or snapshot is a 'named' ZFS storage pool, not a single ZFS filesystem within a pool. Since ZFS filesystems share all the space associated with a single storage pool, all the storage must be configured in the same I/O consistency group, a feature of both SNDR and II.

Although a ZFS storage pool consisting of one volume does not require an I/O consistency group, using one does not impact AVS. This is a good practice to establish, especially if the ZFS storage pool is likely to grow over time. Prior to adding a new volume to an existing ZFS storage pool, first configure it within AVS, adding it to the I/O consistency group.

Initial Synchronization (Resilvering)

When creating a ZFS filesystem that will be replicated with SNDR, it is best to enable the SNDR volumes first, then ZFS volumes second. Because AVS is filesystem agnostic, when enabling volumes for remote replication, the software does not know if the volume(s) being configured are near empty (a new filesystem), partially or near full. Given no further information, SNDR is required to replicate the entire volume, or volumes

Finding a means to avoid the unneeded replication makes sense, as why would one want to replicate GBs or TBs, if portions of the volumes have no valid data in uninitialized blocks.

An SNDR enable option (-E), indicates that the local and remote volumes are equal, and thus forgoes the initial volume synchronization process. The volumes are considered equal because uninitialized data = uninitialized data. When the subsequent 'zpool create ... ' and 'zfs create ... ' commands are issued, the write I/Os issued by ZFS on the local node, will cause the create operations to get replicated to the remote node.

If there are existing ZFS filesystems within ZFS storage pools, that contain valid data, one would assume that SNDR would have no choice but to replicate the entire ZFS storage pool. The good news is there is a feature of ZFS, zpool replacement, makes the sum of the parts (AVS & ZFS), greater then the whole!

To utilize this ZFS feature, one would setup a duplicate set of uninitialized volumes on the local node, sized to matching those in the pre-existing ZFS storage pool one wishes to replicate, then enabling SNDR using the (-E) equal option. By issuing 'zpool replace ... ' specifying these uninitized volumes, the internal ZFS processing of zpool replacement, moves only the valid data from the old volumes to the new volumes (a ZFS resilvering process), allowing SNDR to detect and replicate only the valid data to the remote host. When ZFS is done with its zpool replace processing, the old volumes are automatically removed from the ZFS storage storage pool and can be reclaimed.

For ZFS files systems that already consume a fair amount of the ZFS storage pool, the need to perform any synchronization processing can still be both time and resource consuming. Fortunately there is one option left that also uses the SNDR (-E) option.

Make a tape or other media backup of the physical ZFS storage pool, a pool that has been placed in a 'zpool export ...' state. (Note: II can be used to eliminate this tape backup window, but that that is a subject for another time.) Prior to performing the 'zpool import ...', enable the local and remote SNDR volumes, again using the SNDR (-E) option. Don't forget to also enable I/O consistency groups if needed. The ZFS filesystem(s) can now be used while sending the tape backup using an overnight carrier to the remote site, and restoring the data onto the remote volumes. During this unspecified period of time, SNDR has been tracking all changes being made by ZFS on the primary site. Once the backup has been restored (and verified) at the remote site, SNDR can be placed in replicating mode, and just those changes that happened while the backup tape was in transit, need to be replicated by SNDR. This by fair cuts down on the amount of data that needs to be replicated.

These aforementioned issues with initial synchronization using SNDR, do not apply equally to II. First there is no remote replication costs, as all the I/O is done locally, and for all II shadow volume types except independent shadows, there is no initial resilvering process needed. Even when using II's independent shadows, all II snapshot functionality is instant, (the II name is for Instant Image), meaning that both the master and shadow volume are instantly accessible while the background synchronization is active. The impact of performing a full ZFS storage pool copy is somewhat hidden when using II independent shadow volumes, a non-existant for dependent shadow volumes.

This is just the tip of the AVS & ZFS iceberg, as there are many features of Availability Suite, ZFS and other Solaris data path technologies, that will collectively make OpenSolaris the storage platform of choice.

Sunday Sep 03, 2006

Host-based vs. controller-based data services[Read More]