cn=Directory Manager
All about Directory Server
All | Personal | Sun

20051218 Sunday December 18, 2005

Directory Server disk layout: the "old" way

When you deploy Directory Server in a production environment, you have some decisions to make about the way that you lay out the components on disk. By default, all of the components are placed under the instance root, but when using a traditional filesystem like UFS, QFS, or VxFS, there may be a significant advantage to splitting these components out onto different disk subsystems.

Note that I refer to this as the "old" way because ZFS is going to change a lot of these recommendations, and therefore my next post will talk about it as the "new" way. Of course, ZFS has only recently been released through OpenSolaris and Solaris Express so they may not be considered quite ready for production use (although ZFS has been used internally for Sun home directory servers and other critical components for over a year and a half with no corruption or data loss), so the "old" way is going to be the most common way for now.

First, let's discuss which Directory Server components are candidates for being split out and the type of disk I/O generally associated with each:
  • The Main Database Files -- These are all the *.db3 files that comprise the Directory Server data and index files. With ideal caching, there will not be any read activity against these files, but if that's not the case, then it will be random access, and even a little activity in this area can saturate the underlying storage subsystem. Writes to these files will occur only during checkpoints, and the DB will try to optimize for sequential writes, but if there have been a lot of changes since the last checkpoint, then this can still be very disk intensive and saturate the underlying storage.

  • The Transaction Logs -- These are all the log.* files, which hold a record of all changes that have been made for use when updating the main database files during a checkpoint. Reads from these files should only occur during checkpoints, and they should only be sequential, but generally they will be held in the filesystem cache so there shouldn't be much actual disk I/O there. Writes to these files are also sequential and the write rate by itself does not generally saturate the disks, but the frequent fsync(3C) operations to ensure that the information is on-disk can be very costly. As I wrote earlier, you can use transaction batching to share these fsync calls among multiple writes, but under heavy write load it can still be pretty expensive.

  • The Changelog DB Files -- The Directory Server changelog (and I'll throw the retro changelog in here too because its access patterns are about the same) keeps a record of all changes that occur for the purpose of replicating them to other systems. Both reads and writes here are sequential, and generally are not enough to saturate the underlying storage.

  • The Server Log Files -- The server log files include the access, error, and audit logs, as well as the referential integrity log file (if this plugin is enabled and configured for asynchronous mode, which it should be). The Directory Server itself doesn't read from these files (except possibly for the referential integrity log file, but it will usually be held in the filesystem cache), but if they are read by an external process (e.g., a log analysis tool), then it will be sequential. Writes to these files are always sequential, and are generally not enough to saturate the underlying storage. The main exception to this is the case in which audit logging is enabled (and it is not by default) because writes to it are not buffered and therefore each write would require an fsync, putting it into the same category as the transaction logs. The same is also true for the error log, but unless you've got some kind of debugging enabled or a significant problem of some kind, writes to the error log will be negligible.

  • The DB Cache Backing Files -- By default, the database cache uses mmap(2) for its memory, which means that changes to the content of the DB cache will ultimately be reflected on disk. The use of mmap makes it possible for multiple processes to access the database concurrently (e.g., allowing you to use db2bak or db2ldif while the server is running), but if there are a lot of changes to the DB cache content, then that can cause a lot of disk thrashing. On Solaris, the way to mitigate this problem is to store the DB cache backing files on a tmpfs volume (e.g., in a directory under /tmp). On other operating systems, you should use a ramdisk or whatever equivalent will allow the files to be held in memory. The DB cache is only used while the Directory Server is running, and it doesn't need to be re-used if the server is stopped and then restarted, so there isn't a problem if these files are lost if the system is rebooted. The only potential penalty is that if the files are lost then they need to be recreated when the server is restarted, and for a large DB cache that can take a noticeable amount of time.

  • Backup Files -- Directory Server backups involve sequential disk I/O, both when reading the current DB and when writing the backup files. Binary backups (those created by db2bak or db2bak.pl) can saturate the underlying storage both for reads and for writes. For LDIF backups, the act of reading the DB will generally be sequential and should only cause problems if there's a lot of disk I/O in the DB to support other operations that may be going on, but in that case it can be disruptive. Writing to the LDIF file will always be sequential and usually not enough to saturate the underlying storage.

Based on all of this, it's not too difficult to put together a pretty simple set of recommendations:
  • You should always use the nslapd-db-home-directory configuration attribute to relocate the database cache backing files to tmpfs or a memory-backed filesystem. This is a completely free optimization (because the mmap process won't cause the files to consume the memory twice) and there's just no reason to not do it.

  • If you have the ability to use at least two different disk subsystems for Directory Server components, then you should dedicate one of them for use by the main database files with the nsslapd-directory attribute. Note that this needs to be set not only in the "cn=config,cn=ldbm database,cn=plugins,cn=config" entry, but also in the configuration entry for each backend (e.g., "cn=userRoot,cn=ldbm database,cn=plugins,cn=config").

  • If you have the ability to use at least three different disk subsystems for Directory Server components, then you should dedicate the second for use by the transaction log files with the nsslapd-db-logdirectory attribute.

  • If you have audit logging enabled, then it's probably a good idea to try to isolate the server log files onto their own storage. This can be done using the nsslapd-accesslog, nsslapd-errorlog, and nsslapd-auditlog attributes in cn=config, and also if you've enabled the referential integrity plugin then the log for it may be specified using the nsslapd-pluginarg1 attribute of the cn=Referential Integrity Postoperation,cn=plugins,cn=config entry.

Normally, when we are performing a Directory Server benchmark, we use three storage arrays (in the past, it was the StorEdge T3B array, but now we're using the StorEdge 3510 array) for the server components. Each of them are configured with RAID 1+0 and use UFS with the logging and noatime options. One of these disks is used for the Directory Server database, one for the transaction logs, and the third for the server logs, changelog, and backups. We also put the DB cache backing files in a subdirectory under /tmp.

Of course, all of this talk about disk layout brings up a lot of questions. Some of the most frequently asked questons in this area are:
  • We've got an EMC storage array, so this doesn't apply to me, right? -- This question comes up pretty frequently, and it does less frequently get asked for storage solutions from other vendors, but EMC customers in general seem to think that their storage is some kind of magical device with the ability to circumvent the laws of physics (maybe it's because of how expensive they are). But the fact is that because of the way the server interacts with each of its components, the various types of accesses in a busy directory do have the ability to saturate almost any kind of underlying storage. Unless you can guarantee that all writes will be sequential (like ZFS can), then you'll probably find that it's better to isolate these components to avoid cases where I/O targeted at one component won't interfere with I/O for another.

  • Why should I use UFS instead of VxFS or QFS or {other filesystem here}? -- If you're on Solaris (at least, Solaris 9 after update 2 or any version of Solaris 10), then UFS with logging is generally faster than VxFS for Directory Server operations (and in fact, for many common access patterns). Sun has spent a lot of time optimizing UFS and in most cases it is now faster than the alternatives. It's also quite a bit faster than QFS, but that's more of a specialty filesystem for clusters and not really ideal for use with Directory Server.

    On Linux, the question of which filesystem to use may also have some relevance. The default filesystem on Red Hat (and often the only one available by default) is EXT3, but our testing has shown that both JFS and XFS are usually quite a bit faster. Although we haven't tested Reiser4, earlier versions of ReiserFS were found to be quite a bit slower than EXT3 in most cases.

  • Can I use NFS? -- No, you can't. In many cases, NFS doesn't support the appropriate type of locking required by our underlying database. There is some question as to whether or not Solaris NFS does offer the appropriate locking, but the bottom line is that it is not supported and if you care about the integrity of your data, then you should stay away from it. The fact that we don't support NFS or other network filesystems for use with the Directory Server is documented here in the Directory Server deployment guide. Technically, there shouldn't be any problems backing up to or restoring from NFS, but generally your best bet is to avoid it.

  • What about using the forcedirectio mount option? -- The forcedirectio mount option basically turns off the filesystem cache for any volume on which it is enabled. If you've got caching configured such that everything fits into both the entry cache and the DB cache, then there shoudn't be any actual disk reads for the main database. However, this doesn't apply to the transaction logs or the referential integrity log file. Further, you'll find that performance is degraded after the server is restarted and before the caches have been primed. As a result, it's almost always a good idea to not use the forcedirectio option. If you are benchmarking and want to start with empty filesystem caches, then you can simply unmount and remount all the filesystems, which will invalidate anything from those volumes that may have been in the filesystem cache.


Posted by cn_equals_directory_manager ( Dec 18 2005, 03:03:44 PM CST ) Permalink Comments [5]


Archives
Language
Links
Referrers