cn=Directory Manager
All about Directory Server
All | Personal | Sun

20060501 Monday May 01, 2006

Breaking Up (Directory Data) is Hard to Do

In the 4.x version of the Directory Server, all data was organized into a single database. With the 5.0 release, it became possible to create multiple backend databases. Each suffix must be in a separate database, but it is also possible to define sub-suffixes in their own backends. There can be benefits to having sub-suffixes, including the ability to define different kinds of indexes or different replication toplogies, but is there any performance benefit to be had? The answer (which is a common one when talking about performance) is "it depends".

Many directory vendors recommend creating lots of branches so that the data can be split, but that's often because their servers don't scale beyond two or three processors and the only way that they can handle large numbers of entries is to break them up across lots of different systems or at least multiple instances on the same system. On the other hand, we're constantly testing our server on 4-way, 8-way, 12-way, 24-way, and 32-way systems with memory sizes up to a couple hundred gigabytes, and we're always looking to scale even higher so there's rarely an absolute need to break up the data just to get the scalability that you need. Of course, we're also working on scaling down to help make it possible to get better performance out of larger data sets on existing hardware, so this will help even further.

I should point out that even if it is possible to run the server with a big monolithic database, that may not always be the best choice. It is certainly the easiest case in terms of keeping all the data together, ensuring the best compatibility for client applications, and giving you the flexibility to use whatever DIT you want. However, really big databases can cause headaches when it comes to things like backup and restore, and even more so for LDIF import and export. We are doing things in the Directory Server itself to help combat this in future releases, and external technologies like ZFS snapshots will also dramatically reduce the pain associated with these kinds of operations, but nevertheless there may be legitimate cases in which splitting the directory contents may be beneficial or even necessary.

Historically, the way to achieve a split like this has been to introduce new hierarchy into the DIT or leverage existing hierarchy. These branches would then be split into separate databases in the same instance or even placed on separate instances with chaining to link them together. With the upcoming Directory Proxy Server 6 release (which is now in beta), a new option will be available in the form of data distribution. Distribution will make it possible to split the contents of a flat DIT across multiple instances on the same or separate machines without the need to introduce any hierarchy. This will be much more palatable to existing applications since the introduction of hierarchy is almost always a bad idea.

There are both benefits and drawbacks to splitting the data. First, let's address the case where the data is split into multiple databases in the same instance:
  • As mentioned above, introducing new hierarchy is almost always a bad idea. It can break client applications that don't deal well with the additional branching, particularly those that use onelevel searches or try to add new entries or otherwise construct DNs. It also creates a maintenance problem for cases in which the split is made based on criteria that can change (e.g., if the data is branched based on geographic location and one of the users moves to a different region).

  • Certain types of operations don't always work as expected if it is necessary to cross database boundaries. For example, the server-side sort operation (and anything that depends on it, like the virtual list view) would cause the entries to be sorted within each database but not between the databases.

  • Separating the data into multiple databases in the same instance will help reduce the length of time required to perform an LDIF import or export, as well as related operations like rebuilding indexes. However, it won't do anything to impact the length of time required to back up and restore the data because all databases (including the transaction logs) must be archived and restored together.

  • In most cases, using multiple databases will not do anything to help improve the performance of read operations. The primary case in which read performance might benefit would one in which the database is larger than will fit in memory and therefore most reads will need to go to disk. In this case it's likely that the server will be I/O bound, and putting each database on a separate disk subsystem will help increase the total number of operations that can be performed before the I/O saturation point.

  • Using multiple databases does not really help write performance when viewed from the perspective of disk I/O. In most cases, the only time that the actual database files will be written is during a checkpoint (the rest of the time, the updates go to the transaction logs), and all the writes are ordered by database file and written serially so none of the database files are updated in parallel. Even if the databases were separated onto different disk subsystems, they would still be written one at a time so when one was busy the others would be largely idle.

  • Using multiple databases actually can help write performance to an extent because there is a significant amount of lock contention in write operations that happens at the backend level. If two writes are targeted at two different databases then there will be a lot less contention than if they had been in the same database and therefore more of the updates will be able to be performed in parallel. The amount of lock contention was reduced significantly in the 5.2 patch 4 release as compared with earlier versions, and it should be reduced even further in the upcoming 6.0 release, so the benefits to write performance from splitting into multiple databases will be diminishing in the future.


Most of this remains the same if the data is split across multiple instances, whether on the same or different systems. The backup and restore time does get reduced since each individual server has less data, and if you use the data distribution features coming in Directory Proxy Server 6 then you can avoid adding unnecessary hierarchy. However, there are new problems/benefits that can arise as a result of this.

The first is that in some or all cases, the overall latency (i.e., the length of time that elapses between the client sending the request and receiving the response) may be increased. If all requests are forced to go through a proxy (which will be the case with distribution) or at least some of them need to be chained to another server, then there will be some time required for the additional processing and network communication. Even though the overall throughput (in terms of operations that can be processed in a given amount of time) may be higher, the latency will be as well and it may adversely impact clients that are sensitive to the response time. The increased latency may be even more evident if there are requests that need to be sent to multiple instances. If the associated request doesn't contain anything in it that is specific enough to limit it to just one instance, then that request may need to be broadcast to multiple instances which can increase the total load against the directory environment.

Another issue is that splitting the data among multiple systems means that you need to have more systems running the Directory Server, and potentially others running Directory Proxy Server. This can create additional work for administrators in order to ensure that all systems are kept up to date and running properly. However, this can have some benefits as well because in cases like this it is generally possible to use smaller, cheaper machines to run the Directory Server for each portion of the data when compared with what would be required to run a large monolithic instance. It can also make it feasible to cache the data set across many smaller systems where it isn't an option as a single large data set.

Ultimately, the decision to split the data into multiple chunks isn't one that should be taken lightly. In some cases, it may be the best option (or the only one that is feasible) but most of the time there will be other strategies that will work out better. In general, I wouldn't recommend seriously considering it unless you have a database size at least into the tens of millions of entries, and then it's probably something that we should look at on a case-by-case basis. We work with customers all the time to help determine the best course of action, and if you are considering splitting your data either in the same instance or across multiple instances then it's probably a good idea to have someone take a look at it to see if that is the best choice.

Posted by cn_equals_directory_manager ( May 01 2006, 08:34:58 AM CDT ) Permalink Comments [1]

Comments:

Excellent! Thanks. I have a realtively small directory (500k entries) spilt over 5 backends. I have a hardware replacement coming up this year . Im going take the opportunity to combine them back into a single backend. It will simplify management greatly.

Posted by Freeman Fridie on May 02, 2006 at 11:17 AM CDT #

Post a Comment:

Comments are closed for this entry.

Archives
Language
Links
Referrers