cn=Directory Manager
All about Directory Server
All | Personal | Sun

20051220 Tuesday December 20, 2005

Directory Server disk layout: the "new" way

In my last post, I provided details about the conventional way of layout out Directory Server components on disk. This works well for traditional filesystems like UFS and VxFS, but it is also somewhat wasteful because you're not necessarily getting the most out of the underlying storage. During a checkpoint, the DB disks go to 100% utilization, but are usually idle the rest of the time. The transaction logs don't generate all that much write traffic but can do a lot of expensive fsyncs. It might be helpful if you could pool all the underlying disks together and spread the load across all of them, but you've still got a lot of expense of all the seeks when the disk heads move back and forth from the DB files to the access logs to the transaction logs.

Fortunately, there is a way to get the best of both worlds. You can get the performance benefit of pooling all the disks together and harnessing their collective throughput while at the same time performing only sequential writes so that the disk heads aren't spending all their time going from one place to another. And in fact, you can even further improve I/O performance by reducing the amount of data that actually needs to be written to disk. Then couple that with unprecedented error detection and correction capabilities, and for good measure add in the ability to do instantaneous zero-downtime backups and the ability to restore just as quickly.

If you're thinking that something like this sounds too good to be true, then you're right (at least for now). If you look in Solaris 10, you won't find anything like what I've described. If you look in Linux or AIX or HP-UX or Windows, then you won't find it there either. There is no officially-supported operating system that can offer this capability. However, it's on its way into a future version of Solaris in a fully supported manner, and you can get it today through either OpenSolaris or Solaris Express. I am, of course, talking about ZFS.

All of the things that I said above are absolutely true:
  • ZFS provides many ways in which you can pool all your disks together. You can stripe multiple disks together to make one big volume with no redundancy, or you can mirror disks to make them fully redundant, or you can stripe mirrors or mirror stripes if you want to. And of course, there is RAID-Z, which is kind of like RAID-5 in that it provides striping with pairity so that you only lose the capacity of a single disk (no matter how many disks you have in the pool) without giving up redundancy and fault-tolerance.

  • ZFS uses a copy-on-write (COW) approach to I/O so that all writes are sequential so you don't need to have disks seeking all over the place when performing writes.

  • ZFS offers the ability to use compression in order to even further improve performance in many cases. This may sound counter-intuitive, but most of the time the CPU overhead required to perform the compression is pretty handily outweighed by the performance gains that you get from having to put less data on the disk. And as an added bonus you get increased storage capacity at the same time (and directory data generally compresses very well).

  • ZFS provides end-to-end data integrity using 256-bit message digests with a few options for the digest algorithm. You can even use the 256-bit SHA-2 variant if you're really paranoid. This checksumming mechanism goes beyond what you can get from storage hardware because it can catch errors that happen outside the hardware. See this post for more details.

  • ZFS provides an instantaneous point-in-time snapshot mechanism that allows you to create atomic views of the data that you can instantly roll back to if necessary, or even clone to have multiple divergent branches. And because of its copy-on-write nature, the only extra disk space consumed by a snapshot is the amount taken up by the blocks that have changed after the snapshot was taken (so a snapshot doesn't take any disk space initially, but the amount of space consumed grows over time as changes occur). Because you can create them in an instant and there's very little overhead in having them, you can take lots of snapshots over the course of a day and roll back to any one of them if the need arises. Of course, that assumes that your redundant storage is intact, but if you want to copy a snapshot to another system then you can easily do that as well.

I could go on, but there are a lot of people far more qualified than I am that can tell you all about what ZFS has to offer and how it's implemented. So let's get back to the business of how it can help you with your Directory Server deployment.

Last time, I mentioned how you can split up the various Directory Server components onto separate disk subsystems for better performance. Today, I'm going to talk about how you can keep all those components together and combine all the disks that you were using into a single pool for better results and easier administration. As a simple example, let's assume that you took my advice and you have three different disk subsystems that you had previously been using for the DB, transaction logs, and everything else, respectively, and you want to use them in a single ZFS pool. Let's assume that the device IDs for those disks are c0t0d0, c1t0d0, and c2t0d0. You can create a RAID-Z pool covering all of them with the command:
zpool create directory-pool raidz c0t0d0 c1t0d0 c2t0d0

Once that's done, you will have a ZFS pool named "directory-pool" that is conveniently mounted at "/directory-pool". If you want it mounted somewhere else (e.g., "/export/ds"), then you can change that by setting a value for the "mountpoint" option. While we're at it, let's also enable compression and turn off access time tracking. That can be done with the commands:
zfs set mountpoint=/export/ds directory-pool
zfs set compression=on directory-pool
zfs set atime=off directory-pool

And there you go. You now have a fully-functional, high-performance, redundant, checksummed-out-the-wazzoo storage pool in which you can install and run your Directory Server. And the only files that you need to relocate for better performance are the DB cache backing files (which should always go on a tmpfs filesystem).

So what about creating and restoring backups? We'll start with snapshots, since they are the fastest and cheapest way to do it, and all the other mechanisms are based on them. To create a snapshot of our ZFS filesystem, we can use the command:
zfs snapshot directory-pool@snap1

This will create a snapshot of the "directory-pool" filesystem named "snap1" comprised of whatever happened to be on the disk at that time. At any time after that, we can roll back to that snapshot using the command:
zfs rollback directory-pool@snap1

I should point out that this is a completely safe backup and recovery mechanism that will work as long as the underlying storage is OK. You can take a snapshot in the middle of your heaviest period of write activity and if you need to restore it later, then you'll end up with a database that has exactly the same contents as it did when you took the snapshot, and the DB recovery time (the time that the database will spend replaying transaction logs when it is started) will be much shorter than if you had used db2bak because there is no need to temporarily prevent transaction log removal while the copy is in progress because the copy is instantaneous and therefore there won't be that many outstanding transactions to replay.

If you want to play it safe and back up your data to a remote system (which is always a good idea), then you can easily do this by first taking a snapshot and then using the following command to create a file containing a backup of that filesystem:
zfs backup directory-pool@snap1 > /backup/snap1.backup

Note that the "zfs backup" command will actually write the data to standard output, so you can send it to a file or pipe it to another process or whatever you want to do to get it where it needs to go. In this case, we can assume that "/backup" is an NFS-mounted volume or some other safe, remote repository.

The command given above will create a full backup based on the specified snapshot. You can create incremental backups as well by specifying two snapshots. For example:
zfs backup -i directory-pool@snap1 directory-pool@snap2 > /backup/snap2.incremental

Note that you can technically use this backup to initialize a ZFS filesystem on another system, so theoretically this could be used as a faster means of performing binary copy initialization for replicas. However, the mechanism that I've described here won't work well as-is because the backup will contain the entire Directory Server installation, including things like the configuration and logs that you don't want to put on the other system. It would be nice to be able to just copy the database files over to the other system, and in fact if we plan ahead we can allow for that when we first set up our filesystem and use a separate ZFS filesystem for the database. This is pretty easy to do, but it will take a few steps to describe and this post is already getting long enough, so I'll save that one for a future post. Or if you can't wait, then you should be able to figure it out for yourself using the zpool(1M) and zfs(1M) man pages and using cpio(1) to transfer the files.

Posted by cn_equals_directory_manager ( Dec 20 2005, 11:55:10 PM CST ) Permalink Comments [2]


Archives
Language
Links
Referrers