Thursday March 01, 2007 | cn=Directory Manager All about Directory Server |
Data Distribution in DSEE 6The latest version of the Sun Java Enterprise System was officially released today, and included in it is the 6.0 release of our Directory Server Enterprise Edition suite. There are some great changes in the Directory Server itself (no more limit on the number of masters, new graphical and command-line administrative interfaces, security improvements, added 64-bit platform support for Solaris x86/x64, etc.), and they'll make great fodder for future posts. However, I want to shift the focus of this entry to Directory Proxy Server. I haven't talked about it much in the past, but it has always offered very useful features like transparent load balancing and failover, improved compatibility for clients, data translation, and added security features. But Directory Proxy Server 6 takes a huge leap forward from its predecessor. Not only are there a lot of improvements in the core proxy functionality (e.g., operation-based load balancing, improved connection pooling, support for SASL EXTERNAL, etc.), but it also two major new categories of features: virtual directory operations and data distribution. In this post, I want to focus on data distribution.The new data distribution capabilities in Directory Proxy Server 6 make it possible to dramatically scale the size and performance of your directory environment. On its own, the Directory Server is able to take advantage of large amounts of memory and large numbers of CPUs. However, eventually you're going to hit a limit on the amount of data you can put in a single box and still get acceptable performance. Some of our largest customers (both in terms of the number of entries in their directory environment and in the size of those entries) also have the very strict response time requirements (often single-digit milliseconds). To meet those requirements, you don't have the luxury of going to disk so you've got to serve the data all from memory (in some cases, going with a solid-state disk solution may be a possibility, but that's probably yet another topic for another time). Sun has some big machines (and for Directory Server, it's going to be hard to find anything available right now that can beat the Sun Fire X4600 with 16 Opteron cores and up to 128GB of memory, and if you've got to go monolithic then the E25K can hold over a terabyte of memory), but eventually there's a limit to what one box can hold. Data distribution changes the game by allowing you to split up your data across multiple sets of servers. If one server can hold 25 million entries but you need to support 100 million, then you can break up the data into four sets. This is done in a manner that is virtually transparent to clients, so there's no need to artificially create hierarchy in your data or perform other kinds of transformations. When a client request comes into the Distribution Server, it figures out which set(s) of backend servers might need to be involved in processing that request, and then forwards it on to one of the servers in each of the sets (most of the time, only one set is involved, but some kinds of searches may need to involve multiple sets). You can customize how the data gets split up by specifying which distribution algorithm you want to use (or if you don't like any of them that are provided with the server, you can write your own), and you can customize the way that the Distribution Server picks the actual backend server within that set through a pluggable load balancing algorithm. Another benefit that data distribution can provide is improved write performance. In the past, it's been easy to get improved read performance by simply adding more replicas, but that doesn't work for write operations because in a standard replicated environment, all of the changes have to go everywhere. With data distribution, replication only needs to occur between the servers in a backend set, so if you've got five sets of servers, then you've got the potential for five times the aggregate write performance. We've demonstrated this technology to a number of customers over the last couple of years, and we've seen some very impressive results. I'll be the first to admit that data distribution isn't for everyone. It really is targeted at those environments with large amounts of data that can't fit on a single system, or for those cases in which the single-server write performance isn't adequate. If you're doing fine in your current environment and don't expect to grow by leaps and bounds in the near future, then it's probably not for you. There is a bit of a learning curve, and it's wise to put some thought into how best to split up the data. We've already got improvements lined up for when this functionality gets integrated into OpenDS that we hope will make it easier to use and lower the barrier to entry, but we're also making improvements that we hope will allow for more effective use of single-server (or single replicated set) deployments. If you're doing fine in your current environment and don't expect to grow a lot in terms of amount of data or performance requirements, then the traditional approach is probably still the best. But if you expect to see a lot of new data being added to the server, or a lot more stringent performance requirements, then data distribution might be right up your alley. Posted by cn_equals_directory_manager ( Mar 01 2007, 03:03:10 PM CST ) Permalink Comments [2] Post a Comment: Comments are closed for this entry. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted by laurent marot on March 06, 2007 at 11:02 AM CST #
We realize that this is a bit hard to find right now and we're working on making it easier.
Posted by Neil A. Wilson on March 06, 2007 at 11:11 AM CST #