Wednesday March 29, 2006 | cn=Directory Manager All about Directory Server |
Understanding Indexing and the ALLIDs ThresholdProper index configuration is a very important factor in the performance of the server. If you're missing a key index, then searches that should take microseconds can take hours on large data sets. Conversely, having too many indexes can adversely impact the performance of write operations, particularly for adds and deletes, because it requires more information to be updated in the database. And for both reads and writes, having a poor index configuration (e.g., a bad ALLIDs threshold) can also contribute to sub-optimal performance.In order to understand how to optimally configure indexes in the Directory Server, it is necessary to have at least a basic understanding of how they really work. For the purposes of this discussion, we'll focus on the three most popular index types: presence, equality, and substring. Approximate and extensible matching indexes work in a manner similar to equality indexes. VLV indexing is very different from attribute indexing, but it's also not as relevant to this discussion, so I'll leave it alone for now. The Directory Server database is essentially a set of B-Trees, which provide ordered mappings between key-value pairs (much like the TreeMap structures in Java) that scale extremely well without adversely impacting performance. Given a key, the corresponding value can be retrieved very quickly. In each Directory Server backend, one B-Tree database is used to hold all of the entry data. Each entry is assigned an integer value (the entry ID), and given that entry ID it is a very fast operation to retrieve the corresponding entry data. All of the other databases are indexes, which are used to define an association between the index keys and the entry IDs for the entries that correspond to those keys. Most Directory Server indexes are used to map attribute contents to entries, and we'll get into that in more detail later. But there are some special system indexes that define other kinds of mappings that are very important to the operation of the server. The most notable ones include:
Presence attribute indexes are used to keep track of all entries with a given attribute. For each attribute with a presence index, there is a single key (the plus sign, "+"), whose value is a list of all the entry IDs for all entries that contain at least one value for that attribute. When the server is processing a search like "(sn=*)", it first retrieves the "+" key from the sn index database to get a list of all the entry IDs and then goes to id2entry in order to get the entries with those IDs. Equality indexes are used to keep track of all entries with a given attribute value. Each equality index key is the normalized form of the value prefaced by the equal sign, and the value is a list of all entries that contain that particular attribute value. So when the server is processing a search like "(sn=Smith)" it will retrieve the ID list for the "=smith" key from the sn index and then will go to id2entry to retrieve the entries for those keys. Substring indexes are used to keep track of all entries with attribute values containing substrings. Each value is normalized and then broken into unique three-character substrings, with the beginning of the value signified by the carat symbol ("^") and the end of the value denoted by the dollar sign ("$"), and each of these substrings is prefaced by the asterisk ("*") to form the index keys. So the value "Smith" will result in five substring keys: "*^sm", "*smi", "*mit", "*ith", and "*th$". Whenever a substring search is received by the server, it may need to retrieve multiple keys and merge the ID lists together, depending on the length of the provided substring. As you can see, a single attribute value can result in several index keys. If an attribute has a substring index, then it can be particularly expensive to maintain because each value can have multiple index keys. The more index keys that are involved in a write operation, the more expensive that operation will be, so ideally substring indexes should be kept to a minimum. Another big factor in the performance impact of maintaining indexes is the number of entries matching each key. As the number of entries matching a given key increases, the cost of maintaining that particular index key increases. Further, some index keys may match a very large percentage of the entries in the server (e.g., the "=top" key for the objectClass index) and therefore could play a part in degraded performance for writes to a large percentage of the entries. However, this does not actually happen because of the ALLIDs threshold (as configured by the nsslapd-allidsthreshold attribute in the cn=config,cn=ldbm database,cn=plugins,cn=config entry). This configuration attribute controls the maximum number of entries that will be allowed to match a given index key. After an index key matches more than this number of entries, the value is replaced with a special token that indicates that this particular key will no longer be maintained. Other keys within the same index will still be maintained, so you could get into a case where "(sn=Smith)" is unindexed because the "=smith" key has hit ALLIDs while "(sn=Smithers)" would be indexed because the number of entries matching "=smithers" is below the limit. When configuring the Directory Server's ALLIDs threshold, there are generally four things to consider:
As with most things, the rules for tuning ALLIDs are a little fuzzy since there are always special circumstances that can have an impact in the decision. In most cases during our testing, we don't tune ALLIDs at all. Even if we're looking at hundreds of millions of entries, the default ALLIDs value is fine because all of the operations are such that none of the operations target multiple entries, so the indexes we'll be using will typically only have one ID per key. This is a very common real-world scenario, especially for the larger directories, since the clients accessing it are frequently Web applications that only need to perform each operation in the context of one user. We're actually more likely to increase ALLIDs for the smaller "enterprise" directories since they are the servers that need to serve as e-mail address books and telephone directories which do need to process searches that return multiple entries. Hopefully, having a basic understanding of the way that Directory Server indexing works will help make the problem clearer so you can make more informed decisions about how best to tune it. Posted by cn_equals_directory_manager ( Mar 29 2006, 10:30:02 AM CST ) Permalink Comments [5] Post a Comment: Comments are closed for this entry. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted by Superpat on March 29, 2006 at 03:16 PM CST #
Posted by Neil Wilson on March 29, 2006 at 03:26 PM CST #
Posted by Freeman Fridie on March 30, 2006 at 01:40 PM CST #
Posted by Ashok Nair on April 03, 2006 at 03:57 PM CDT #
If you're sure that the attributes you're trying to index actually exist, then I'd recommend checking the index definitions to make sure that everything looks correct. If you can't find a problem with the configuration, then I'd recommend contacting technical support to have the problem looked at more closely.
I'm not aware of any bugs in this area, so hopefully it's just a configuration problem.
Posted by Neil Wilson on April 03, 2006 at 06:05 PM CDT #