We pushed the Darkstar roadmap out on the site last week. This leads to no little concern among the engineering team, in that we are now predicting what we are going to do in the future. But this concern was offset by our desire to let the wider Darkstar community know what we are planning on doing, and the hope that we can get some feedback from that community on whether or not our plans are the right thing for them.

We currently have the multi-node version up and running in our lab, and are doing a set of performance tests on it. We are also trying to add a lot of instrumentation and some network resource throttling. The instrumentation is there so that we can get some idea of what is happening inside the multi-stack Darkstar server. It is always a challenge getting the right information about concurrent or distributed systems. We spend a lot of time looking at log files, trying to reconstruct what is happening in the system. We don't want to subject any of the users of the infrastructure to the same pain, so one of the goals of the instrumentation is building some tools to everyone can see what is happening in a better way.

We have already discovered one interesting thing about the log files from the current users, which is that log files and transactional systems don't necessarily mix well. Our current logs write out messages when they happen; this means that there are a lot of messages that will be written when a task is being run which stay written if the task is aborted. Then the same messages will be written a second time to the log when the task is re-run (hopefully to completion this time).

This is good information for those of us doing the core development, since we can see some of the times that the tasks are aborting and can start tracking down concurrency problems. But for developers using Darkstar, these logs can make it look like some operations are happening more often than they should. Since the users of Darkstar aren't supposed to know about the transactions, it isn't surprising to find that they don't check in the log files to see what is committing and what is aborting.

We've actually talked about having two sets of log files. The first would be the current log files, that are useful for those of us trying to see what is happening in the core. The second set would be transactional. Like the current Session and Client services, the messages to these logs would not be written until the transaction in which they occur commits. This second log would be more useful for users of the infrastructure, while the former would be more useful for those developing the infrastructure itself. However, writing two sets of logs seems somewhat excessive, and other priorities have been seen as higher, so we haven't done this yet. But summer is coming, which means a new set of interns (and a transactional log would be a good intern project), so this still might go forward. Opinions on how useful this would be from the community would, of course, help us make up our minds.

The throttling problem is a clearer as a priority, even if it isn't quite as easy to solve. Simply put, we need to add some code so that clients can't overwhelm the entire system by sending too many messages, either on the Session connections or on Channels. Right now, all of the messages are buffered, but eventually all of the buffering space will run out. When that happens, bad things can set in to Darkstar. We've seen some cases where the system freezes up, and is unable to make any additional progress. Clearly, this is a bad thing; we are looking into ways to disconnect the offending client to insure that this won't happen.

This is really a specific case of a more general problem that we have started thinking about. The goal of Darkstar is to let you spread the server side of a game over lots of machines and lots of threads without having to worry about how many machines and how many threads. But any system is going to have some resource limits, and when those limits are hit the system needs to be able to protect itself in a reasonable way. It isn't clear to us what constitutes a "reasonable way" in all cases. In the case of a client that is flooding the network with too much traffic, it seems reasonable to disconnect that client (if nothing else, it would seem that such a client might be misbehaving, either intentionally or because of a bug). But what to do in the case of other resources (memory, threads, or storage all come to mind) is not so clear. If any of you out there have any ideas, we would love to hear them.

Anyway, take a look at the roadmap and see what you think. If something is missing, or if some feature is more important than we think it is, tell us that we got the priorities wrong (but be willing to tell us what we should not be doing instead; we are a small group so the effort is zero-sum). If we are working on something that isn't really important at all, tell us. We are always happy to not do something. And tell us what you think we should be doing after the work on the current map is done. After all, the current map only extends for the next year or so (which is about as far into the future as I think I can see), so it's time to think about what's next...

Comments:

Just thought I'd add a comment to let you know that the Roadmap is appreciated. When looking and working with Darkstar, I had no idea when a multi-node version of Darkstar was going to be released and since there was no information being shared publicly, it was possible that it wasn't even anywhere near completion. Now that I know that the multi node version is close, it gives me more confidence in continuing to use Darkstar. I don't even mind if it gets delayed - it's just comforting to know that it's a few months away instead of a year or more.

Transactional logs sounds like a good feature, as well as disconnecting clients that are flooding the server. As far as other resources, such as threads or memory, I'd hope that a load balancing version of Darkstar would be able to shift ManagedObjects around between servers which would help with that. How that occurs in the back end is your problem! :-D

Thanks again

Posted by Clay Larabie on March 19, 2008 at 02:13 PM PDT #

I think the transactional log and even some extra debugging info for re-tried transactions would be of great help. Before I finally realized that there is a timeout of 100ms for transactions and they get re-scheduled, I spent a couple of days instrumenting the Darkstar source code thinking it was a bug. Hours later I finally saw the stacktrace indicating that the task took too long (I was accessing an SQL database in the loggedIn() method). If such things could be printed out even to the standard log for future reference (indicating the task that took long), I think it could help a lot.

As for the roadmap, I think it's awesome you guys decided to post it. And the blogs (yours and Seth's) with the insight, probably even better ;) Keep the info coming!!!

Posted by Maciej Miechowicz on March 21, 2008 at 08:29 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by Jim Waldo