The last couple of weeks we have been getting the 1.0 release of Darkstar together, which is the reason that I've been sort of quiet. This is an important release for us, because this will be the first release where you can run your Darkstar-based game on multiple machines by simply starting extra copies of the game and Darkstar stack. The different copies will have access to the same data, be able to send messages on the same channels, and essentially be able to simply scale by adding new machines. This is the real thing, the one we have been talking about for some time, released to the community.

We have been running the multi-node version of the stack internally for a couple of weeks now, finding (and fixing) bugs and attempting to get a handle on how the new version performs. We were expecting that the performance of the multi-node version on a single server would not be as good as the single-node version, but we wanted to get some idea of just how the two differed. The multi-node version needs to do a lot more work than the single-node version, to make sure that it can deal with servers crashing and can keep track of what is going on where, so we were expecting something of a performance hit. The question was how much.

This is not nearly as simple of thing to measure as you might think. The multi-node version of the stack differs from the single node version that we put out as 0.9.5 in a number of ways that have to do with more than just running on a single machine. The channel service, for example, has undergone some major changes, and we have added code that will back off on the communication systems when one machine is flooding the stack with too many messages. With all these changes, it is not a simple A/B comparison. That we expected.

What we didn't expect were the problems that are more intrinsic to the design of Darkstar itself. The whole idea behind Darkstar is to present the programmer with the illusion that he or she is writing code on a single machine running in a single thread, while using lots of machines and lots of threads (and cores) to actually run that code. In a sense, we are trying to run a system that hides the complexity of the actual execution environment from the programmer. And it turns out that we did a pretty good job-- so good, in fact, that we are having trouble penetrating the smoke screens we have set up for the user of Darkstar so that we can find out what is really going on under the covers.

This is similar to the problem I talked about in my last post having to do with logging. Our current logging mechanism logs what is going on in a transaction (which is not a notion that is visible to the programmer). Some of the things that go on happen in transactions that are aborted (since they touch some data that is already being used by another transaction in another thread), so when the aborted task gets re-scheduled it looks like the operation happens a second time. We have talked about building a transactional log that can be used to record only those operations that actually happen (and the feedback from the blog makes it more likely that we will build such a thing). But we (the Darkstar development team) need the non-transactional log so that we can see all of what is going on.

There are lots of other things that affect the performance (as opposed to the correctness) of the system that are similarly hidden. It would be nice to find out how many threads are actually running at a given time, along with what tasks they are performing and how much contention is being found. It would be nice to know how long after they are aborted tasks are re-tried. We would love to see what kind of data access patterns are being seen, and how the I/O system is performing under load.

But since we are hiding all of this mechanism under the covers, there is no way to write simple applications that will gather this information. Instead, we need to instrument the Darkstar stack itself in the right places to see what is happening that is intentionally hidden from the user of Darkstar. So that is where we are spending a lot of our time these days.

We also realize that we are not the only audience for the tools that will result from this work. Anyone using Darkstar will benefit from these tools, since they will want to see what is actually going on with their own application. Just as we need to see what is happening under the covers, so users of Darkstar will want to take a look every now and then to see why their game or virtual world is performing the way it is.

This is an interesting comment on abstraction, in that you can hide things from the developer of a system and still have correctness, but if you are tuning for performance you may need to reveal more of what is actually going on. Of course, this has been true for all of the levels of abstraction that we have introduced in the past (high-level languages, abstract data types, objects). So I guess that we should not be surprised that we find this true in the case of Darkstar. It does lead to the interesting question of just what can be hidden, and for how long, in our systems. But that's why we are doing this in a research lab...

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by Jim Waldo