cn=Directory Manager
All about Directory Server
All | Personal | Sun

20060329 Wednesday March 29, 2006

Understanding Indexing and the ALLIDs Threshold

Proper index configuration is a very important factor in the performance of the server. If you're missing a key index, then searches that should take microseconds can take hours on large data sets. Conversely, having too many indexes can adversely impact the performance of write operations, particularly for adds and deletes, because it requires more information to be updated in the database. And for both reads and writes, having a poor index configuration (e.g., a bad ALLIDs threshold) can also contribute to sub-optimal performance.

In order to understand how to optimally configure indexes in the Directory Server, it is necessary to have at least a basic understanding of how they really work. For the purposes of this discussion, we'll focus on the three most popular index types: presence, equality, and substring. Approximate and extensible matching indexes work in a manner similar to equality indexes. VLV indexing is very different from attribute indexing, but it's also not as relevant to this discussion, so I'll leave it alone for now.

The Directory Server database is essentially a set of B-Trees, which provide ordered mappings between key-value pairs (much like the TreeMap structures in Java) that scale extremely well without adversely impacting performance. Given a key, the corresponding value can be retrieved very quickly. In each Directory Server backend, one B-Tree database is used to hold all of the entry data. Each entry is assigned an integer value (the entry ID), and given that entry ID it is a very fast operation to retrieve the corresponding entry data. All of the other databases are indexes, which are used to define an association between the index keys and the entry IDs for the entries that correspond to those keys.

Most Directory Server indexes are used to map attribute contents to entries, and we'll get into that in more detail later. But there are some special system indexes that define other kinds of mappings that are very important to the operation of the server. The most notable ones include:
  • The entrydn index defines a mapping between the DNs of entries and their entry IDs. The key for each record is the normalized DN of an entry, and the value is the entry ID for that entry. To retrieve an entry with a given DN, the server will normalize that DN, go to the entrydn index to get the corresponding entry ID, and will then go to the id2entry database to retrieve the entry with that entry ID.
  • The parentid index defines a mapping between the entry ID for a non-leaf entry and the entry IDs of all of the immediate children for that entry (that is, the key is the entry ID of the parent entry and the value is the set of entry IDs for all the children). This index may be used when processing onelevel searches.
  • The ancestorid index defines a mapping between the entry ID for a non-leaf entry and the entry IDs of all descendants (not just the immediate children) for that entry. This index may be used when processing subtree searches.

Presence attribute indexes are used to keep track of all entries with a given attribute. For each attribute with a presence index, there is a single key (the plus sign, "+"), whose value is a list of all the entry IDs for all entries that contain at least one value for that attribute. When the server is processing a search like "(sn=*)", it first retrieves the "+" key from the sn index database to get a list of all the entry IDs and then goes to id2entry in order to get the entries with those IDs.

Equality indexes are used to keep track of all entries with a given attribute value. Each equality index key is the normalized form of the value prefaced by the equal sign, and the value is a list of all entries that contain that particular attribute value. So when the server is processing a search like "(sn=Smith)" it will retrieve the ID list for the "=smith" key from the sn index and then will go to id2entry to retrieve the entries for those keys.

Substring indexes are used to keep track of all entries with attribute values containing substrings. Each value is normalized and then broken into unique three-character substrings, with the beginning of the value signified by the carat symbol ("^") and the end of the value denoted by the dollar sign ("$"), and each of these substrings is prefaced by the asterisk ("*") to form the index keys. So the value "Smith" will result in five substring keys: "*^sm", "*smi", "*mit", "*ith", and "*th$". Whenever a substring search is received by the server, it may need to retrieve multiple keys and merge the ID lists together, depending on the length of the provided substring.

As you can see, a single attribute value can result in several index keys. If an attribute has a substring index, then it can be particularly expensive to maintain because each value can have multiple index keys. The more index keys that are involved in a write operation, the more expensive that operation will be, so ideally substring indexes should be kept to a minimum.

Another big factor in the performance impact of maintaining indexes is the number of entries matching each key. As the number of entries matching a given key increases, the cost of maintaining that particular index key increases. Further, some index keys may match a very large percentage of the entries in the server (e.g., the "=top" key for the objectClass index) and therefore could play a part in degraded performance for writes to a large percentage of the entries. However, this does not actually happen because of the ALLIDs threshold (as configured by the nsslapd-allidsthreshold attribute in the cn=config,cn=ldbm database,cn=plugins,cn=config entry). This configuration attribute controls the maximum number of entries that will be allowed to match a given index key. After an index key matches more than this number of entries, the value is replaced with a special token that indicates that this particular key will no longer be maintained. Other keys within the same index will still be maintained, so you could get into a case where "(sn=Smith)" is unindexed because the "=smith" key has hit ALLIDs while "(sn=Smithers)" would be indexed because the number of entries matching "=smithers" is below the limit.

When configuring the Directory Server's ALLIDs threshold, there are generally four things to consider:
  • You should never set the ALLIDs threshold to be larger than the number of entries in the Directory Server. It just doesn't make any sense to do so, and it will degrade performance all the way around. Even if you have a legitimate reason to perform a search like "(objectClass=person)" every so often, it will actually be faster for the server to process that as an unindexed search by cursoring through the id2entry database than if it had to iterate through an ID list and retrieve each entry manually. The last time I tested it, I found that if you needed to perform a search that matched more than around 80% of the entries in the server, it was faster for that search to be unindexed than indexed.
  • How many entries might the server need to return for any given search, particularly for those based on equality filters? If all of your searches are equality searches that match only a single entry, then you don't need to worry about ALLIDs because none of the indexes that you actually care about will hit this limit. On the other hand, if you do frequently perform searches like "(sn=Smith)" that might need to return lots of entries, then it may be necessary to increase the ALLIDs threshold accordingly.
  • Do you do any substring searches? Because all of our substring keys are currently fixed at three characters, any given substring index key is more likely to have hit ALLIDs than an equality key because they will match larger numbers of entries (e.g., "*smi" will match all "Smith" values as well as "Smithers", "Smithson", "Naismith", etc.). In this case, then it may be necessary to increase ALLIDs. However, you should note that this will be less necessary if you enforce a requirement for clients to provide longer substrings in their requests. For example, if you require clients to provide a minimum of five characters in a substring, then there will be at least three substring keys involved (characters 1-3, 2-4, and 3-5), and if at least one of them has not yet hit ALLIDs then the search will be at least partially indexed.
  • Do you have a very hierarchical directory? For example, some of our customers have thousands of branches in their server, with each branch representing one of their own customers or some other kind of organization, and some number of entries below that. In this case, then it may be a good idea to ensure that the ALLIDs threshold is higher than the number of entries below one of those branches. The reason for this is that if a subtree search would otherwise be unindexed (or potentially if it matched a large enough set of entries) then the ancestorid system index could be used to try to narrow down the set of possible matches by constraining it to the set of all entries below the search base. If the key for the search base DN in the ancestorid index has hit ALLIDs, then this won't be possible.

As with most things, the rules for tuning ALLIDs are a little fuzzy since there are always special circumstances that can have an impact in the decision. In most cases during our testing, we don't tune ALLIDs at all. Even if we're looking at hundreds of millions of entries, the default ALLIDs value is fine because all of the operations are such that none of the operations target multiple entries, so the indexes we'll be using will typically only have one ID per key. This is a very common real-world scenario, especially for the larger directories, since the clients accessing it are frequently Web applications that only need to perform each operation in the context of one user. We're actually more likely to increase ALLIDs for the smaller "enterprise" directories since they are the servers that need to serve as e-mail address books and telephone directories which do need to process searches that return multiple entries. Hopefully, having a basic understanding of the way that Directory Server indexing works will help make the problem clearer so you can make more informed decisions about how best to tune it.

Posted by cn_equals_directory_manager ( Mar 29 2006, 10:30:02 AM CST ) Permalink Comments [5]

Comments:

Another excellent entry, Neil - I always enjoy reading this stuff even though I'll never need to configure DS to this level of precision. One buglet (my bold): "When the server is processing a search like "(sn=*)", it first retrieves the "+" key from the cn index".

Posted by Superpat on March 29, 2006 at 03:16 PM CST #

Thanks for pointing out my typo. It should be fixed now. Hopefully it was obvious what I really meant and wasn't too confusing for anyone.

Posted by Neil Wilson on March 29, 2006 at 03:26 PM CST #

Great info!!...cleared up a lot of things for me. How about a short talk on ACI's, recommendations, impact on performance etc; Thanks

Posted by Freeman Fridie on March 30, 2006 at 01:40 PM CST #

Thanks for writing on DS. One question on indexing ... will backend_attr.db3 be created for each attribute we intend to index (using db2index.pl)? The reason I ask is even after adding attribute like dn: cn=givenname,cn=index,cn=userRoot,cn=ldbm database, cn=plugins,cn=config and running db2index.pl there is no .db3 file created for some of the attributes. DOes the DS somehow decide whether to create or not? Thanks Ashok

Posted by Ashok Nair on April 03, 2006 at 03:57 PM CDT #

In general, you will see a database file for each indexed attribute. However, this won't be the case for attributes that don't actually appear in the data. If you have an index defined on attrX but attrX doesn't appear anywhere in your data, then you will likely not see a corresponding database file for it.

If you're sure that the attributes you're trying to index actually exist, then I'd recommend checking the index definitions to make sure that everything looks correct. If you can't find a problem with the configuration, then I'd recommend contacting technical support to have the problem looked at more closely.

I'm not aware of any bugs in this area, so hopefully it's just a configuration problem.

Posted by Neil Wilson on April 03, 2006 at 06:05 PM CDT #

Post a Comment:

Comments are closed for this entry.

Archives
Language
Links
Referrers