cn=Directory Manager
All about Directory Server
All | Personal | Sun

20060525 Thursday May 25, 2006

Replication versus Cluster for DS High Availability

In many environments, the directory service is mission critical. If it's down or not working properly, then it will cause all kinds of problems. It can prevent your customers from logging into their online accounts. It can prevent employees from getting their e-mail or logging into servers and workstations. In some environments, it could even mean that parts won't show up at the factory on time or that the door access control systems won't work. In almost all environments, a directory service outage can mean lost revenue and/or productivity, so it is vital that the service stays up all the time.

There are a lot of aspects to ensuring a highly-available directory service and I won't attempt to cover all of them here. But one topic that keeps coming up from time to time is the question of whether to use the Directory Server replication features or to use a high availability cluster or some other external HA solution like storage-level replication. In this post, I'll detail why Directory Server replication is clearly the better choice. It will be presented primarily in the context of Directory Server versus an HA cluster, but many of the arguments may apply to alternate solutions as well. I will also not address the possibility of using replication in which each instance is a part of its own cluster, as that can be a very expensive solution and doesn't really offer much benefit outside of just usting replication by itself.

First, let's take a quick look at Directory Server replication. It is the means by which changes in the data are kept in sync across multiple instances. Whenever a change is made on a master Directory Server instance, it will be propagated to all other servers in the environment. In the current 5.2 patch 4 release, you can have up to four masters capable of accepting changes and any number of read-only replicas that will automatically refer any write attempts to the masters. In the upcoming Directory Server 6.0 release, it will be possible to have any number of masters. Each replica (whether read-only or read-write) has its own copy of the database, and can have its own settings for things like indexing and cache sizing. We have a sophisticated conflict detection and resolution mechanism that helps ensure that concurrent changes to the same entry on different servers are handled properly. The total directory service load can be split across all of the instances, and if you're using Directory Proxy Server (which is included as part of DSEE at no additional cost) if one of the Directory Server instances becomes unavailable for some reason client requests can be transparently redirected to other servers. Replication is a core part of the Directory Server product (i.e., not an add-on that requires additional fees or licensing) that is fully supported and is in heavy use by virtually all of our customers.

On the other hand, HA cluster products attempt to achieve high availability through monitoring and failover. The service is running on one system at any given time, but if a problem occurs that would make it unavailable (e.g., a hardware failure or OS crash) then it will be migrated to another machine in the cluster. When it fails over to another system, it will retain the same network settings like IP address and hostname so there isn't need for an external load balancer or Directory Proxy Server, and it will use the same storage so there isn't a need to have multiple copies of the data. However, the Directory Server instance can only be running on one node of the cluster at any given time, and there will be an interruption in service for the time required to detect the failure and migrate the Directory Server instance.

So already, we've seen a few differences that make Directory Server replication much more attractive than clustering. In the event of a failure, a "high availability" cluster actually guarantees downtime because it does take time for it to detect the failure and migrate to another node. Since there's a single instance running in the cluster, you can't take advantage of all the hardware concurrently to get improved performance. And since it's using the same storage then you can't have different indexing or other database-level configuration differences.

To examine the issue further, let's look into the different failure modes that can impact a directory service:
  • Server Hardware Failure. In the event that a hardware failure leaves a system offline or in a degraded state, clients in a replicated environment could have instantaneous and transparent failover to another instance. A clustered environment can handle this failure as well, since the service would be migrated to another node, although there would be some downtime.

  • Operating System Crash or Failure. With either replication or clustering, an OS crash is handled in essentially the same way as a hardware failure. If the OS hangs rather than crashes, then clients in a replicated environment would need to have some kind of timeout, after which the request can be redirected to another replica (the Directory Proxy Server can handle this transparently). In a clustered environment, an OS hang would be essentially treated in the same way as an OS crash.

  • Storage Hardware Failure. If the storage hardware failure is catastrophic and results in the loss of all access to the data, then a replicated environment would be able to survive because each Directory Server instance has its own copy of the database on its own storage. In a clustered environment, it would not be possible to handle this because the nodes share the same storage. It is true that a failure like this is pretty unlikely if the storage itself has been designed with a highly-available configuration using RAID and redundant devices, but in that case a failure of one component could lead to degraded performance for some period of time (e.g., if RAID 5 needs to reconstruct data from parity for all reads, and there would be an I/O performance hit when rebuilding the failed component after it has been replaced).

  • Whole Site Failure / Unavailability. If an entire site becomes unavailable for some reason (e.g., catastrophic network failure, extended power outage, natural disaster, etc.), then clients in a replicated environment can access other servers in other sites because replication works well in WAN environments. Clustering is a lot more complicated in WAN environments, and if it is even feasible in this kind of outage (e.g., when combined with replicated storage) it may result in a scenario in which multiple nodes are providing the service concurrently and there can be irreconcilable conflicts if multiple nodes allow data to be updated.

  • Directory Server Crash. If the Directory Server process crashes, then clients in a replicated environment can be transparently redirected to other instances that are still online. In a clustered environment, there will be an outage until the Directory Server is restarted. In most cases it will be faster to restart the server on the same node than to migrate it to a different node, so the cluster doesn't provide much benefit there other than detecting the failure.

  • Directory Server Hang / Freeze. If the Directory Server process hangs, then clients in a replicated environment can be transparently redirected to other instances that are still functional after some timeout is detected. Since the service itself is still available, the hung instance can be left in that state to see if it recovers itself (in most cases, this is simply a case in which all of the worker threads have been tied up processing expensive requests and the server will become responsive once processing on those requests has completed) or to capture diagnostic information like stack traces and/or core files that could help understand the cause of the hang. In a clustered environment, since there is only one instance it must be killed and restarted immediately to minimize the service outage, which prevents the opportunity for the server to recover itself, and which can inhibit the ability to identify the cause of the hang. In many cases, killing and restarting on the same system will be faster than killing and migrating to a different node. If the service does need to be forcefully killed, then it will also take additional time to restart as it may need to verify the integrity of the database.

  • Directory Server Database Corruption. If the Directory Server database becomes corrupt for some reason in a replicated environment, then it will only impact one instance since all instances have their own database and replication is based on sending LDAP changes, not replicating at the database level. A clustered environment is not able to handle this kind of failure because it uses the same storage and database instance when failing over between nodes. It will be necessary to restore from a backup, which could mean that any changes since the last backup could be lost, and in some rare cases corruption can remain hidden for quite a while before it is discovered, and therefore even the backups could have this corruption.

  • Directory Server Unable to Meet Load Performance Demands. If the Directory Server is unable to meet the performance demands of the clients, then this can be addressed in a replicated environment by adding additional replicas to handle the load. At present, this is really only an option for read operations, since write performance does not increase as additional replicas are added, but Directory Proxy Server 6 will also include a solution for providing write scalability as well. In a clustered environment, since only one node can be online at any time the only way to improve performance would be to buy bigger and/or faster hardware.

  • Administrative Downtime. If a Directory Server instance needs to be made unavailable to perform some administrative action, then it will not have a significant impact in a replicated environment because all other instances can remain online to handle the requests. In a clustered environment, it depends on the nature of the administrative action to be taken. If the action is simply to restart the Directory Server then that can be done on the same machine without the need to fail over. If the system itself needs to be rebooted, then the service can be manually migrated to avoid the need to detect the failure which will add to the downtime. For potentially long-running operations that impact the Directory Server instance itself (e.g., restoring a backup, importing from LDIF, or rebuilding indexes), there is no choice but to keep the instance offline until that operation is complete.


I could go on, but hopefully I've made my point. High availability clusters are very good in many environments, but they are not as well-suited for keeping your directory service up and running as Directory Server replication. In addition, we are continuing to make improvements in the area of replication to make it easier to use and monitor, to offer better performance and lower latency, and to include more features, so the advantages of using replication will only continue to grow.

Posted by cn_equals_directory_manager ( May 25 2006, 10:26:20 AM CDT ) Permalink Comments [2]

20060518 Thursday May 18, 2006

A Quick Introduction to ASN.1 BER

Many network protocols are text-based, which has the advantages of being relatively easy to understand if you examine the network traffic, and in many cases you can even interact with the target server by simply telnetting to it and typing in the appropriate commands. However, there are disadvantages as well, including that they are generally more verbose and less efficient to parse than they need to be. On the other hand, other protocols use a binary encoding that is more compact and more efficient. LDAP falls into this category, and uses the ASN.1 (abstract syntax notation one) mechanism, and more specifically the BER (basic encoding rules) flavor of ASN.1. There are a number of other encoding rules (e.g., DER, PER, CER, etc.) that fall under the ASN.1 umbrella, but since LDAP uses BER that's what I'll focus on in this post. In general, when I talk about ASN.1, I mean BER.

I should first point out that this is a very cursory overview of ASN.1 and doesn't attempt to cover everything. I'm largely focusing on the subset of BER that is actually used by LDAP, and there are some obscure special cases that I'll not get into as well. For a much more in-depth reference, check out the excellent ASN.1 Complete reference book by John Larmouth which is freely available in PDF form (although you do have to fill out a form to be able to download it) or you can buy the book in "dead tree" form. I should also say that this discussion assumes that you have at least a basic understanding of binary and hexadecimal numbering systems. If you aren't familiar with that or need to brush up on it, then I'm sure you'll be able to find plenty of sites to help with that.

BER elements use a TLV structure, where TLV stands for "type, length, and value". That is, each BER element has one or more bytes (in all cases I'm aware of in LDAP, it's only a single byte) that indicates the data type for the element, one or more bytes that indicates the length of the value, and the encoded value itself (where the form of the encoded value depends on the data type) which can be zero or more bytes. I'll expand on each of these in the next sections.

The BER Type
The BER type indicates the data type for the value of the element. There are lots of different data types available, but the most commonly-used (at least in LDAP) include OCTET STRING (which can be either a text string or just some binary data), INTEGER, BOOLEAN, NULL, ENUMERATED (like an integer, but where each value has a special meaning), SEQUENCE (an ordered collection of other elements, kind of like an array), and SET (the same as a sequence, except that the order doesn't matter). There is also a CHOICE element, but most of the time it just means that you can have one of a few different kinds of elements.

As I mentioned above, the BER type is usually only a single byte, and this byte has data encoded in it. The two most significant bits (i.e., the two leftmost bits, since BER always uses big endian/network ordering) are used to indicate the class for the element. The possible class values are:
  • 00 -- This is the universal class. All of the "standard" BER elements have a universal type, so any time you see an element with a universal type you know what kind of data it holds. Examples of universal types include 0x01 (BOOLEAN), 0x02 (INTEGER), 0x04 (OCTET STRING), 0x05 (NULL), 0x0A (ENUMERATED), 0x30 (SEQUENCE), and 0x31 (SET). You'll notice that the binary encodings for all of those type values have the leftmost two bits set to zero.

  • 01 -- This is the application-specific class. This class is used to allow an "application" to define its own types that will be consistent throughout that application. In this context, LDAP is considered an application. For example, any time you see 0x42 in LDAP, you know that it indicates an unbind request protocol op because RFC 2251 section 4.3 states that the bind request protocol op has a type of "[APPLICATION 2]".

  • 10 -- This is the context-specific class. This class is used to indicate that the type is specific to a particular usage within a given application. The same type may be re-used in different contexts in the same application as long as there is enough other information for you to determine which context is applicable in a given situation. For example, in the context of the credentials in a bind request protocol op, the context-specific type 0x80 is used to hold the bind password, but in the context of an extended operation it would be used to hold the request OID.

  • 11 -- This is the private class. I'm not aware of any cases in which it is used in LDAP.

The next bit (the third from the left) is the primitive/constructed bit. If it is set to zero (i.e., "off"), then the element is considered primitive and therefore the value would be encoded in accordance with the rules of that data type. If it is set to one (i.e., "on"), then it means that the value is constructed from zero or more other ASN.1 elements that are concatenated together in their encoded forms. For example, if you look at the universal SEQUENCE type of 0x30, the binary encoding is "00110000" and the primitive/constructed bit is set to one indicating that the value of the sequence is constructed from zero or more encoded elements.

The final five bits of the BER type byte are used to specify the value of that type, and it's treated as a simple integer value (where "00000" is zero, "00001" is one, "00010" is two, "00011" is three, etc.). The only special value is "11111", which means that the type value is larger than can fit in the five bits allowed so multiple bytes will be required. Since this doesn't happen in LDAP we'll ignore it in this discussion.

The BER Length
The second component in the TLV structure of a BER element is the length. This specifies the size in bytes of the encoded value. For the most part, this uses a straightforward binary encoding of the integer value (e.g., so if the encoded value is five bytes long, then it would be encoded as 00000101 binary, or 0x05 hex), but if the value is longer than 127 bytes then it will be necessary to use multiple bytes to encode the length. In that case, the first byte has the leftmost bit set to one and the remaining seven bits are used to specify the number of bytes required to encode the full length. For example, if there are 500 bytes in the length (hex 0x01F4), then the encoded length will actually consist of three bytes: 82 01 F4.

Note that there is an alternate form for encoding the length called the indefinite form. In this mechanism, only a part of the length is given at a time, kind of like the chunked encoding that is available in HTTP 1.1. However, this form is not used in LDAP (as per RFC 2251 section 5.1), so it won't be discussed here any further.

The BER Value
The value is the heart of the BER element because it contains the actual data of the element. Because BER is a binary encoding, the encodings can take advantage of that to represent the data in a compact form. As such, each data type has its own encoded form. These encodings include:
  • NULL -- The NULL element never has a value, and therefore the length is always zero.

  • OCTET STRING -- The value of this element is simply encoded as a concatenation of the raw bytes of the data being represented. For example, to represent the string "Hello", the encoded value would be "48 65 6C 6C 6F". The value can have a length of zero bytes.

  • BOOLEAN -- The value of this element is always a single byte. If all the bits in that byte are set to zero (i.e., 0x00), then the value is "FALSE". If one or more of the bytes is set to one, then the value is "TRUE". This means that there are 255 different ways to encode a BOOLEAN value of "TRUE", although in practice it's generally encoded as 0xFF (that is, all the bits are set to one).

  • INTEGER -- The value of this element is encoded as a binary integer in two's complement form (see this article if you're not familiar with this representation). Although BER itself does not place a limit on the magnitude of the values that can be encoded, many software implementations have a cap of four or eight bytes (i.e., 32-bit or 64-bit integer values), and LDAP generally uses a maximum of 4 bytes (allowing you to encode values within the plus or minus 2 billion range). There will always be at least one byte in the value.

  • ENUMERATED -- The value of this element is encoded in exactly the same way as the value of an INTEGER element.

  • SEQUENCE -- The value of this element is simply a concatenation of the encoded BER elements contained in the sequence. For example, if I wanted to encode a sequence with two octet string elements encoding the text "Hello" and "there", then the encoded sequence value would be "04 05 48 65 6C 6C 6F 04 05 74 68 65 72 65". A sequence value can be zero bytes if there are no elements in the sequence.

  • SET -- The value of this element is encoded in exactly the same way as the value of a SEQUENCE element.

BER Encoding Examples
Now that we've covered the basics of encoding the type, length, and value components of a BER element, we can put together some examples. The example above for encoding a SEQUENCE value actually had two complete BER elements concatenated together: the OCTET STRING representations of the strings "Hello" and "there". They are:
04 05 48 65 6C 6C 6F
04 05 74 68 65 72 65

In both of these cases, the first byte is the type (0x04, which is the universal primitive OCTET STRING type), and the second is the length (0x05, indicating that there are five bytes in the value). The remaining five bytes are the encoded representations of the strings "Hello" and "there".

Another simple examle would be to encode the integer value 3. This time, though, let's use a context-specific type value of 5 rather than the universal INTEGER type. In this case, the encoding would be:
85 01 03

Now let's go for a little more involved (and more practical) example. Let's encode an LDAP bind request protocol op as defined in RFC 2251 section 4.2. A simplified BNF representation of this element is as follows:
BindRequest ::= [APPLICATION 0] SEQUENCE {
     version                    INTEGER (1 .. 127),
     name                       OCTET STRING,
     authentication             CHOICE {
          simple                [0] OCTET STRING,
          sasl                  [3] SEQUENCE {
               mechanism        OCTET STRING,
               credentials      OCTET STRING OPTIONAL } } }

In this case, we'll encode a bind request using simple authentication for the user "cn=test" with a password of "password". The complete encoding for this bind request protocol op is:
60 16 02 01 03 04 07 63 6E 3D 74 65 73 74 80 08 70 61 73 73 77 6F 72 64

That's a fairly long string of bytes, but let's break it down to make it simpler:
  • The first byte is 0x60 and it is the BER type for the bind request protocol op. It comes from the "[APPLICATION 0] SEQUENCE" portion of the definition. Because it's application-specific, then the class bytes will be 01, and because it's a SEQUENCE, then it will be constructed. If you put that together with a type value of zero, then the binary representation is "01100000", which is 0x60 hex.

  • The second byte is 0x16, which indicates the length of the bind request sequence. 0x16 hex is 22 decimal, and if you count the number of bytes after the 0x16 then you'll see that there are 22 of them.

  • Next comes "02 01 03", which is a universal INTEGER value of 3. It corresponds to the "version" component of the bind request sequence, and it indicates that this is an LDAPv3 bind request.

  • Next comes "04 07 63 6E 3D 74 65 73 74", which is a universal OCTET STRING containing the text "cn=test". It corresponds to the "name" component of the bind request sequence.

  • The last component is "80 08 70 61 73 73 77 6F 72 64", which is an element with a type of "context-specific primitive 0" and a length of eight bytes. From the definition of the bind request protocol op above, we can see that "context-specific" maps to the simple authentication type and that it should be treated as an OCTET STRING, and upon closer examination we can see that those eight bytes in the value do represent the encoded string "password".

I realize that was a pretty significant jump in complexity between my examples. However, hopefully if you can follow the explanation of the encoding of the bind request element, then you're well on your way to being able to debug LDAP protocol communication. For additional help, check out the LDAPDecoder tool provided as part of SLAMD (if you use the "-b" option, it will show you the raw bytes for the communication along with the decoded human-readable representation. You can also check out the code in the com.sun.slamd.asn1 package in the SLAMD source code for a Java implementation of a simple BER encoder/decoder (it's what the LDAPDecoder uses behind the scenes to translate between raw bytes and BER elements).

Posted by cn_equals_directory_manager ( May 18 2006, 08:03:47 AM CDT ) Permalink Comments [2]

20060508 Monday May 08, 2006

Tips for Developing LDAP Applications

So far in my posts I've tended to focus on Directory Server performance by tuning the server itself. However, the design of the client also plays an important role in the overall performance of the environment. A poorly-designed client can cause problems by consuming a lot of server resources and interfering with its ability to process other requests. This can be avoided by configuring resource limits in the server, but in the process it would prevent the server from fully processing the inefficient requests. Therefore, it is important to ensure that clients are designed properly so that they can perform the appropriate operations as efficiently as possible.

The following are a collection of tips that I've lifted from a presentation I gave last year at a Product Masters Event. They're not all performance-oriented, but hopefully they will be helpful. Some of these probably deserve their own posts, so I'll try to expand on them in the future.
  • Make sure to use LDAPv3 rather than LDAPv2. Some APIs still default to LDAPv2, but LDAPv2 doesn't support features like controls, extended operations, referrals, SASL authentication, and multiple binds on the same connection.

  • Use at least minimal caching to avoid repeating the same queries. If you include a list of attributes to return, then make sure that you include all attributes you may need rather than performing different queries to retrieve the same entry with different attribute lists.

  • Design your application to allow for loose consistency in replication and the possibility that reads and writes may happen on different systems without the application's knowledge. Avoid read-after-write behavior because it can have inconsistent results.

  • Don't treat the Directory Server like a relational database. Avoid splitting data into separate pieces so that you need to retrieve multiple entries to get all the information about a given entity.

  • If you generate search filters, then do so intelligently. If you have compound filters, then use a form like "(&(a=b)(c=d)(e=f))" rather than "(&(&(&(a=b))(c=d))(e=f))" to avoid unnecessary nesting.

  • Unbind connections when they're no longer needed. It's generally best to re-use connections as much as possible, but whenever you're done with a connection make sure it gets closed.

  • Know the standards. There are a lot of them, but RFCs 2251 through 2256 will give a good overview. The specifications define what's legal and what isn't -- just because something happens to work doesn't mean that you should depend on it if it isn't a standard behavior.

  • Learn to capture and interpret LDAP communication over the network. If your application is misbehaving then it will help dramatically if you can see the actual requests that it's sending and the responses that it is receiving. The LDAPDecoder tool provided with SLAMD has been designed specifically for this task, although other utilities like Ethereal or even Solaris snoop may be sufficient.

  • Be directory-independent. Even though we'd prefer that you use Sun's directory, it's always best to design applications that are based completely on standard behavior. Even Sun's directory may change the way that certain features are implemented between releases, and as such you should be wary of proprietary features. If you do use features that are implementation-specific, then try to compartmentalize them into pluggable code that can be easily replaced if necessary.

  • Use controls and extended operations wisely. Realize that not all servers support the same sets of extensions, and take that into account when designing the application. You can use the root DSE to determine whether the server supports a given control or extended operation. In some cases, it may be possible to provide a client-side implementation (e.g., client-side sort instead of server-side sort) as an alternative.

  • Don't litter your code with hard-coded attribute/objectclass names, base DNs, server addresses/ports, usernames/passwords, etc. If you need to change something later, it can be hard to make sure that everything gets updated properly. You should centralize all such values in a constants class or a properties file so that they are simple to change if necessary.

  • Where possible, maintain a set of persistent connections to the server (i.e., connection pools) rather than connecting and disconnecting for each operation. This will be much more efficient, especially when using SSL. In order to avoid leaking connections and duplicating large amounts of code, it may be a good idea to code the various types of operations into the connection pool itself so that those operations will check out a connection, perform the operation and any necessary error handling, and make sure the connection is put back into the pool.

  • If you do use pooled connections, then use the proxied authorization control (if the server supports it) to avoid the need to constantly rebind as a given user in order to perform operations.

  • Design your application to be able to handle the different kinds of failures that may arise: server down, network outage, DS backlogged or unresponsive, DS returning unexpected responses (e.g., unavailable or busy). Don't assume that a lost connection means the server is down -- it could be that the connection was closed due to the idle timeout or some other constraint.

  • In some environments, it may be necessary to allow for the possibility of failing over to a read-only server. In this case, your application should be able to follow referrals, and also to handle cases where none of the referred servers are available. Also, consider adding a read-only mode that could allow your application to continue with at least partial functionality in the even that no writable servers are available.

  • Consider the authentication mechanisms you might need to support. Simple authentication (bind DN and password) is virtually guaranteed to be supported, but may require SSL/StartTLS. In some cases (depending on access control configuration), SASL authentication may be required for some operations. You should never try to retrieve the hashed password and compare it against what the user provided because this can introduce significant security holes in your application and can bypass password policy and account lockout constraints.

  • When binding, make sure that the user actually specified a password, since simple binds that don't contain a password will be treated as anonymous. Consider using the authorization identity controls (RFC 3829) or the LDAP "Who Am I?" extended operation (draft-zeilenga-ldap-authzid) to ensure that the authentication was actually performed as the appropriate user and isn't anonymous.

  • When binding, make sure to check for password policy controls to see if the password has expired or will expire in the near future. Don't design your application to expect hard-coded result codes or error messages to figure out the reason for the bind failure.

  • Work within the access control constaints of the underlying Directory Server. Don't perform all operations as an administrator, as that may open security holes and can make auditing difficult or impossible. Avoid programmatic interaction with the ACIs of the underlying server because they are non-standard and the syntaxes may vary between servers or even between releases of the same server. Keep access controls simple and avoid using too many of them to reduce performance impact and preserve clarity. You may be able to use the GetEffectiveRights control to determine what rights the client has.

  • When designing your application, make sure to document the kinds of operations that may be performed, both individually and in sequences of operations. Consider developing a SLAMD job that can simulate the access patterns of your application to help administrators better understand the impact that changes to the server configuration might have on the applications using it.

  • Talk with the server administrators to discuss any schema or indexing changes that your application may require. Let them know about any controls or extended operations you might want to use to ensure they are supported and that they won't significantly hurt server performance. Also indicate anticipated usage levels to ensure that the administrators can prepare for the increased load that the application may cause.

  • Use groups effectively. Prefer dynamic groups over static groups since they are much easier to maintain and faster to work with. Avoid roles altogether, since they are non-standard functionality and don't really provide much benefit. If you must use static groups, then let the server determine the membership with a filter like "(|(member=userdn)(uniqueMember=userDN))" instead of retrieving the entire member list and trying to make the determination on the client side.


Posted by cn_equals_directory_manager ( May 08 2006, 08:43:48 AM CDT ) Permalink Comments [5]

20060501 Monday May 01, 2006

Breaking Up (Directory Data) is Hard to Do

In the 4.x version of the Directory Server, all data was organized into a single database. With the 5.0 release, it became possible to create multiple backend databases. Each suffix must be in a separate database, but it is also possible to define sub-suffixes in their own backends. There can be benefits to having sub-suffixes, including the ability to define different kinds of indexes or different replication toplogies, but is there any performance benefit to be had? The answer (which is a common one when talking about performance) is "it depends".

Many directory vendors recommend creating lots of branches so that the data can be split, but that's often because their servers don't scale beyond two or three processors and the only way that they can handle large numbers of entries is to break them up across lots of different systems or at least multiple instances on the same system. On the other hand, we're constantly testing our server on 4-way, 8-way, 12-way, 24-way, and 32-way systems with memory sizes up to a couple hundred gigabytes, and we're always looking to scale even higher so there's rarely an absolute need to break up the data just to get the scalability that you need. Of course, we're also working on scaling down to help make it possible to get better performance out of larger data sets on existing hardware, so this will help even further.

I should point out that even if it is possible to run the server with a big monolithic database, that may not always be the best choice. It is certainly the easiest case in terms of keeping all the data together, ensuring the best compatibility for client applications, and giving you the flexibility to use whatever DIT you want. However, really big databases can cause headaches when it comes to things like backup and restore, and even more so for LDIF import and export. We are doing things in the Directory Server itself to help combat this in future releases, and external technologies like ZFS snapshots will also dramatically reduce the pain associated with these kinds of operations, but nevertheless there may be legitimate cases in which splitting the directory contents may be beneficial or even necessary.

Historically, the way to achieve a split like this has been to introduce new hierarchy into the DIT or leverage existing hierarchy. These branches would then be split into separate databases in the same instance or even placed on separate instances with chaining to link them together. With the upcoming Directory Proxy Server 6 release (which is now in beta), a new option will be available in the form of data distribution. Distribution will make it possible to split the contents of a flat DIT across multiple instances on the same or separate machines without the need to introduce any hierarchy. This will be much more palatable to existing applications since the introduction of hierarchy is almost always a bad idea.

There are both benefits and drawbacks to splitting the data. First, let's address the case where the data is split into multiple databases in the same instance:
  • As mentioned above, introducing new hierarchy is almost always a bad idea. It can break client applications that don't deal well with the additional branching, particularly those that use onelevel searches or try to add new entries or otherwise construct DNs. It also creates a maintenance problem for cases in which the split is made based on criteria that can change (e.g., if the data is branched based on geographic location and one of the users moves to a different region).

  • Certain types of operations don't always work as expected if it is necessary to cross database boundaries. For example, the server-side sort operation (and anything that depends on it, like the virtual list view) would cause the entries to be sorted within each database but not between the databases.

  • Separating the data into multiple databases in the same instance will help reduce the length of time required to perform an LDIF import or export, as well as related operations like rebuilding indexes. However, it won't do anything to impact the length of time required to back up and restore the data because all databases (including the transaction logs) must be archived and restored together.

  • In most cases, using multiple databases will not do anything to help improve the performance of read operations. The primary case in which read performance might benefit would one in which the database is larger than will fit in memory and therefore most reads will need to go to disk. In this case it's likely that the server will be I/O bound, and putting each database on a separate disk subsystem will help increase the total number of operations that can be performed before the I/O saturation point.

  • Using multiple databases does not really help write performance when viewed from the perspective of disk I/O. In most cases, the only time that the actual database files will be written is during a checkpoint (the rest of the time, the updates go to the transaction logs), and all the writes are ordered by database file and written serially so none of the database files are updated in parallel. Even if the databases were separated onto different disk subsystems, they would still be written one at a time so when one was busy the others would be largely idle.

  • Using multiple databases actually can help write performance to an extent because there is a significant amount of lock contention in write operations that happens at the backend level. If two writes are targeted at two different databases then there will be a lot less contention than if they had been in the same database and therefore more of the updates will be able to be performed in parallel. The amount of lock contention was reduced significantly in the 5.2 patch 4 release as compared with earlier versions, and it should be reduced even further in the upcoming 6.0 release, so the benefits to write performance from splitting into multiple databases will be diminishing in the future.


Most of this remains the same if the data is split across multiple instances, whether on the same or different systems. The backup and restore time does get reduced since each individual server has less data, and if you use the data distribution features coming in Directory Proxy Server 6 then you can avoid adding unnecessary hierarchy. However, there are new problems/benefits that can arise as a result of this.

The first is that in some or all cases, the overall latency (i.e., the length of time that elapses between the client sending the request and receiving the response) may be increased. If all requests are forced to go through a proxy (which will be the case with distribution) or at least some of them need to be chained to another server, then there will be some time required for the additional processing and network communication. Even though the overall throughput (in terms of operations that can be processed in a given amount of time) may be higher, the latency will be as well and it may adversely impact clients that are sensitive to the response time. The increased latency may be even more evident if there are requests that need to be sent to multiple instances. If the associated request doesn't contain anything in it that is specific enough to limit it to just one instance, then that request may need to be broadcast to multiple instances which can increase the total load against the directory environment.

Another issue is that splitting the data among multiple systems means that you need to have more systems running the Directory Server, and potentially others running Directory Proxy Server. This can create additional work for administrators in order to ensure that all systems are kept up to date and running properly. However, this can have some benefits as well because in cases like this it is generally possible to use smaller, cheaper machines to run the Directory Server for each portion of the data when compared with what would be required to run a large monolithic instance. It can also make it feasible to cache the data set across many smaller systems where it isn't an option as a single large data set.

Ultimately, the decision to split the data into multiple chunks isn't one that should be taken lightly. In some cases, it may be the best option (or the only one that is feasible) but most of the time there will be other strategies that will work out better. In general, I wouldn't recommend seriously considering it unless you have a database size at least into the tens of millions of entries, and then it's probably something that we should look at on a case-by-case basis. We work with customers all the time to help determine the best course of action, and if you are considering splitting your data either in the same instance or across multiple instances then it's probably a good idea to have someone take a look at it to see if that is the best choice.

Posted by cn_equals_directory_manager ( May 01 2006, 08:34:58 AM CDT ) Permalink Comments [1]


Archives
Language
Links
Referrers