cn=Directory Manager
All about Directory Server
All | Personal | Sun

20060428 Friday April 28, 2006

SLAMD 2.0.0-alpha1 is now available

I'm pleased to announce that the SLAMD Distributed Load Generation Engine version 2.0.0-alpha1 is now available for download at http://www.SLAMD.com/ and at https://SLAMD2.dev.java.net/. It's currently characterized as an alpha release because it is not quite feature complete, although our testing has shown it to be quite usable and stable. In fact, the primary reason that I wanted to make this build available now is that it will probably be several months before the official 2.0.0 build is done and there is lots of good stuff already done that people might want to use before I get around to adding the last few features and significantly re-vamping the documentation.

A fairly comprehensive list of all the changes between this release and the previous SLAMD 1.8.2 version is available in the Release Notes, but the major changes include:
  • SLAMD now uses an embedded database (using the Berkeley DB Java Edition) for storing all configuration and job information, so it no longer needs and external configuration directory for this purpose. This makes it much simpler to set up and manage a SLAMD server instance.

  • SLAMD now provides a new job groups feature, which makes it possible to define a set of jobs that form a common workload and schedule them as a single entity. I have previously discussed job groups, but a lot of additional work has gone into it since then.

  • The SLAMD server and clients have been updated so that they are now simpler to use. Where possible, the commands will attempt to automatically determine the location of the Java installation, and the command line client and resource monitor client have been updated so that the process of configuring them is now the same on Windows as it is on UNIX-based systems.

  • Lots of bugs have been fixed, and lots of minor improvements and tweaks have been added to generally improve the overall user experience.

  • The SLAMD source code base has been moved from a CVS repository into a Subversion repository, and therefore the project on java.net has been moved from https://SLAMD.dev.java.net/ to https://SLAMD2.dev.java.net. Instructions for checking out the source are available at http://www.SLAMD.com/subversion.shtml.

  • The source code build process is now based on Ant rather than shell scripts. Although this is not a tremendous benefit for UNIX-based systems, it does make the code much easier to build on Windows. Further, there is now a single build that works on both Windows and UNIX systems, as opposed to a .zip build for Windows and a .tar.gz. build for UNIX.

As mentioned above, this is classified as an alpha release because it is not feature complete. In particular, I would like to re-write the client code to use a new protocol for communicating with the SLAMD server. This will add a lot more flexibility, like the ability to have clients automatically detect and download new job class versions on the server, or to be able to allow the clients to send additional information back to the server. There is also a significant amount of documentation work that still needs to be done, although the Quick Start Guide is up to date if you're looking for a simple set of instructions for getting started.

Posted by cn_equals_directory_manager ( Apr 28 2006, 09:40:49 AM CDT ) Permalink

20060420 Thursday April 20, 2006

Understanding Directory Server DB Growth

The size of the Directory Server database is very important for a number of reasons. The on-disk size of the database plays a very large role in cache sizing determinations. The size of the database is directly related to the length of time required to back up and restore the server data. The size of the database also dictates how much disk space you need to have, and this gets multiplied by the number of replicas and backups that you maintain.

It's important to note that the initial size of the database after performing an LDIF import may be somewhat misleading because the database will likely not be that size for very long. Rather, it will probably grow over time, and that growth will probably be notably faster just after an import than at other times. There are a number of reasons for this, including:
  • The first time that an entry is updated after it has been created, it will be updated to include certain operational attributes like modifiersName and modifyTimestamp.

  • In a replicated environment, whenever an entry is updated a certain amount of replication metadata may need to be added to the entry. For example, if an attribute is modified then the server will need to store the previous value(s) for conflict resolution purposes. This metadata may need to be mainained for a period of time at least equal to the tombstone purge delay.

  • Whenever an entry is deleted in a replicated environment, that entry won't actually be removed from the database right away but will be temporarily converted into a special type of "hidden" entry called a tombstone that may be needed for the purpose of conflict resolution. It won't actually be removed for a period of time at least equal to the tombstone purge delay.

  • Whenever database is initially imported, the entries are in a relatively compact formation, potentially with multiple entries on a single page. If an entry is updated and the new entry won't fit in the space where it previously resided, then it will need to be moved to a new page, leaving a hole where the entry used to be. This hole might be filled in at least partially later if something else needs to be stored that will fit ino it, but it's likely that there will still be some space lost because the new item probably won't be exactly the same as the entry that was there originally.

  • As changes occur, they will need to be written into the replication changelog and/or retro changelog, and these are part of the database. This growth may be bounded by placing size/time limits on the changelog contents.

Ultimately, if you bound the size of the changelog and don't add any new entries then the size of the database will eventually stabilize and become relatively constant, although the stable size may be significantly higher than the initial size. This growth may be lessened by reducing the tombstone and changelog purge delays, since both of those play a role in how much metadata buildup there can be. If you have very large entries, then increasing the database page size may also help significantly because it can avoid the use of overflow pages.

Another very important point to remember is that the size of the database will never shrink. Whenever information is removed from the database, the associated space is marked as "free" within the DB and will be re-used if possible before new space is allocated, but it won't be given back to the OS. So this means that if you add millions of entries and then delete them, the database size will increase during the adds, but won't decrease during the deletes (and in fact might increase just a little as the entries are converted to tombstones).

Note, however, that we are taking steps to try to reduce the impact of this database growth as much as possible. I can't go into a lot of detail yet, but within the next couple of releases, you should see some significant changes that will both reduce the initial size of the database and will help stem its growth over time. We're also looking at ways that we can alter other processes that may be impacted by the database size (e.g., backup and restore) to make them more efficient. All of this work should help make the Directory Server easier to manage, and should also result in ways that we can make it perform and scale even better than it already does.

Posted by cn_equals_directory_manager ( Apr 20 2006, 09:54:36 AM CDT ) Permalink Comments [2]

20060412 Wednesday April 12, 2006

How to Hide an Easter Egg

When I was a kid, I participated in a decent number of Easter egg hunts around this time of the year. In many ways, I suppose I still haven't grown up because I still enjoy finding them. But nowadays, when I go looking for them it's usually in an electronic form. If you know where to look, you can find them in a number of places. You can find them in DVD menus, music CDs, video games, and even embedded in hardware.

Of course, developers often embed Easter eggs in software. Some of them aren't all that hard to find, and you may come across them even by accident. I've found several such eggs in DVD menus that way. In fact, a few years ago when I was just a lowly support engineer I stumbled across the Directory Server 5.0 Easter egg when I was examining a core file from a crash that a customer experienced. I won't tell you how to find it, but it shoudln't be that hard if someone is looking for it and has a little ingenuity.

Some Easter eggs are a little harder to find. Take SLAMD for example. It's completely open source, and every single line of the code is available for you to reivew, but it takes a lot of the fun out of it if you can just look at the source code to find the Easter eggs. It is therefore necessary to come up with a much better way of hiding them so that it's still not obvious to even the closest scrutiny.

First, let me say that I take no credit whatsoever for the mechanism that I've used to hide the Easter eggs in SLAMD. It was entirely the brainchild of David Ely, one of my colleagues (and recently promoted to be my new boss -- congrats, David!) who at the time used it to hide a little something in the administration console for our Identity Synchronization for Windows product. I only tweaked it slightly so that it was better-suited for use in a Web application.

So how does it work? First, I come up with a special query string that I want to be used to be able to access the egg, for example "easter=egg". I generate an MD5 digest based on that query string and use that in the source code. If the set of parameters provided to the application doesn't define a valid operation, then I calculate the MD5 digest of the user-provided query string and match it against the digest for the Easter egg. If it does match, then I take the clear-text query string and use it as the key to decrypt a file that has been encrypted using DES. The decrypted contents are then used as the body of an HTML page with the content that I want to use as the egg.

It's true that DES is considered pretty weak and MD5 is starting to show some signs of weakness, and it's also true that this same approach would work just as well with other message digest and/or encryption algorithms. However, I don't consider this to be a significant problem because what I'm hiding isn't all that sensitive. Also, the "weakness" of these algorithms is a relative term, and breaking them is still far outside the realm of the casual cryptographer.

In SLAMD, most of the work for creating the Easter egg is handled by the src/CreateEgg.java source file. Simply point it at a file containing the clear-text content to use as the body of the page and give it the query string to use to display it. It will then create a new file with the DES-encrypted content and print out the MD5 digest of that query string. That MD5 digest goes in the QUERY_STRING_MD5 variable in the com.sun.slamd.common.Constants class and the encrypted file content gets put in the com/sun/slamd/md5 directory of the lib/slamd_resource.jar file. The final piece of the puzzle is in the com.sun.slamd.admin.AdminServlet class, where doPost checks the MD5 digest of the query string as a last resort and if a match is found calls generatePageFromMD5 to decrypt and display the content.

As you can see, neither the query string nor the content that will be displayed when that query string is provided are stored in the clear. They can be stored in the open without giving away the mystery.

Posted by cn_equals_directory_manager ( Apr 12 2006, 08:21:45 AM CDT ) Permalink

20060405 Wednesday April 05, 2006

Don't Benchmark with Real Data

I spend a fair amount of time working with customers helping them benchmark the Directory Server for one reason or another. In some cases, it's to help them decide whether to go with our Directory Server or a product from another vendor. In others, it's to help examine a performance problem that they've run into. In still others, it's to help demonstrate how the product will behave when they switch to new hardware or new entries or new attributes to existing entries.

In many of these cases, the customer wants us to use their actual data in order to get the most accurate results. However, that's almost always a bad idea and would hurt a lot more than it would help. There are many reasons that you should avoid using real data when benchmarking the Directory Server. Some of those include:
  • Real data carries with it real privacy concerns. Even basic information like names, e-mail addresses, and phone numbers can be trouble if it gets into the wrong hands. In some industries, there are laws and regulations that control what can and can't be done with the data, but most of the time it's just easier to use fake data.

  • It's much easier to store information about how to generate fake data than to handle the real data. An 8K template is a lot easier to handle than a 100GB LDIF file.

  • Once you have generated a template that can be used to create fake data that looks like your real data, then you can use that template to help troubleshoot issues that you might have. If you do run into a problem and need to contact technical support, having a template that can generate fake data that looks like real information can help us to visualize what your server is like and can help us set up our own tests to try to reproduce the problems in our labs.

  • You can design the data to make it easier for use in performance testing. This includes things like using sequentially-incrementing counters in attribute values and/or entry DNs, using consistent passwords for all users, or generating files with potential search filters for use in conjunction with that data.

  • You can more easily test with information that you don't have. If you generate the data, then you can add additional attributes per entry (e.g., if you're considering adding new directory-enabled applications into your environment), and you can also test with larger numbers of entries (e.g., if you anticipate growth in the future).

Of course, using generated data is completely worthless if that information does not accurately reflect the actual data (or the expected actual data) in the environment. SLAMD contains two tools that can help with this. The first is MakeLDIF, which is a utility that can generate very realistic LDIF data using simple templates. It has been around for quite a while and is included with all released versions of SLAMD. The second is the LDIF Analyzer, which can examine a real-world LDIF file and summarize the data that it contains so that you can more easily create a MakeLDIF template to simulate that information. This is a newer tool that is currently only available by checking out the SLAMD source from CVS, although I do expect to make an alpha release of SLAMD 2.0 available in the very near future. Using these tools, you can create a MakeLDIF template that closely matches your actual data but is much more portable and easier for use in benchmarking.

Posted by cn_equals_directory_manager ( Apr 05 2006, 09:59:28 AM CDT ) Permalink Comments [2]


Archives
Language
Links
Referrers