(An apology to those of you who may have seen this previously as a single block of text. The html accepted by Safari seems to be different than that accepted by Firefox, and I had previewed in the former but not the latter. I do so love standards...)

I've just spent a week getting back into work after a two-week vacation, and must say that I'm still feeling pretty relaxed. I didn't do anything special on the vacation-- hung around the house, did some reading and some writing, and lots of projects that have been building up over the past couple of years. In some ways, the biggest change is that I didn't have to commute to work every day. But it was relaxing. Even after the requisite penalty week, when you have to make up for the fact that you have been away, I'm still feeling pretty mellow.

I did track some of the comments to the previous entries for this blog having to do with the directions we are taking to try to get to multi-node while I was away. Most of the comments were pretty supportive, but there were some that were essentially variants of "there are simpler ways to do this; why don't you just rely on the game programmer to tell you how to do things, and you could have a product out much more quickly."

I have a lot of sympathy for those making these comments. Certainly, if the goal of Project Darkstar was to put out a product that could be used by game server developers to make their lives easier, and to do so in the quickest way possible, we would not be trying to do the sorts of things I was talking about. There are much easier things that could be very useful, and perhaps commercially viable, that could be put out much more quickly and without nearly the risk that the current plan entails.

But we aren't a product group, and those aren't our goals. Understanding that, and understanding what our goals really are, can help others to understand why we make the decisions that we make, and why we are taking the directions we are taking. Such understanding will also help people to understand what our risk portfolio is, and why.

Research, by its nature, means that we don't know what we are doing. Put another way, if we know how to do something, then doing it isn't research. Being part of a research group means that we have to be trying something that hasn't been done before, but that if it is done would be useful. In doing that research, we often need to build an artifact that tests whether or not we have done the thing that we are trying to do. But we have to be careful not to confuse the artifact with the research; the artifact is a means to the end of the research. The real end of the research is the finding out of something new.

In our case, the research is centered around ways of making it easier to program scalable applications that span multiple cores and multiple machines. Our investigations are around simplifying the programming of such scalable applications by enabling the programmer to act as though he or she was programming to a single core on a single machine, and then automagically spreading the work over all available cores and all available machines.

This is currently a hot set of topics in the research world. Most of the Google infrastructure, such as map-reduce and the chubby lock service and the whole of the infrastructure of Google itself, is about is solving this problem. Amazon worries about this sort of thing with the infrastructure behind their web services. Just doing what those organizations have done would be useful in certain circumstances, but it wouldn't be research. We can already read the papers they have written or examine the products they have shipped to find out how to do these things.

What makes Project Darkstar different is the context in which we are trying to solve these problems. Unlike Google or Amazon, who are worried about large jobs that require a lot of throughput but are pretty much latency insensitive, we decided to look at the problem in the context of games and virtual worlds, which don't require huge amounts of throughput but are very sensitive to latency. Because of the different context and requirements, the solution to the problem needs to be different. And it is a fun community to work with. After years of working with partners and customers from the traditional parts of enterprise computing, working with game developers makes me feel young again.

Latency, by the way, is generally much more difficult to deal with than limitations on bandwidth. Bandwidth is, in the simplest form, a problem that can be solved by throwing money at the problem. If you need twice as much bandwidth, just put in another pipe, and (modulo coordinating the two sources) you have doubled your bandwidth. Latency, on the other hand, has an upper bound that is dictated by physics. We can sometimes find places in our software where we have added latency to this upper bound, for example by having extra layers of software or doing multiple copies of buffers. But as we get rid of these extra slow-downs, there becomes nothing that we can do to really help in latency. It is just the way things are.

But back to what we are trying to do with Project Darkstar. We aren't trying to find the easiest, or least risky, or most predictable way for us to help game programmers scale their on-line games or virtual worlds. While we are not randomly trying to increase the risk in what we are doing, we are definitely picking certain areas of the technology where we are being as aggressive as possible in pushing the technology. This increases the risk, makes the whole thing harder to do, and might seem like an odd way to proceed. But the idea is that we will learn something by pushing in this way, even if we fail to produce something that works.

For example, it would be a lot simpler if we added an API to allow the game server programmer to tell us which players should be co-located on the same server. This could be something that was static (allowing all the members of a single guild or raiding party to be put together) or dynamic (allowing all of the players who are in a particular zone to be put together). This actually only makes our (the Project Darkstar team) job easier, since it moves the responsibility of figuring out the co-location scheme to the game programmer. It also means that we can blame someone else if something goes wrong. But if we were trying to minimize our risk, that would be the right thing to do.

Instead, we are trying to figure out the co-location groupings based on the network of player communication and object interactions ourselves. This is much harder (for us) than letting someone else do it. If we succeed, the job of the game programmer will be considerably easier, but if things don't work we will have no one but ourselves to blame. We have not only increased the risk, but we have concentrated it on the project team rather than spreading it around to our users.

But if we succeed we will have learned something. This approach to distributed load balancing isn't something anyone has (to our knowledge) done before. But if we can do it, it will mean that we will have learned a number of things. We will learn, for example, that the social networks that can be derived from the interaction information accurately reflects data access patterns. We will have a principled mechanism for doing distributed load balancing. And we will have drawn a connection between one area of research (determining social networks) and a seemingly unrelated area of research (how to best distribute certain kinds of tasks around a network of processors).

If it doesn't work, we will need to learn why. Maybe the social networks that we discover are too transitory to be useful in load balancing. Maybe we can't determine the networks with enough precision to make this useful. Maybe it will take too long to determine the networks. Maybe it will be all of these things, taken in combination, that keep this from working. Finding that out would be disappointing (after all, we think that it will work, that's why we're trying to do it), but it wouldn't mean that the project itself was a failure, only that the artifact that we built to test our hypothesis showed that the hypothesis was wrong.

If we were building a product, the possibility of these sorts of failures would be the same as the possibility of the failure of the product itself. When building a product, the artifact is the desired end of the work, so when the artifact fails, the project fails. But since we are doing research, finding these sorts of failures are actually a way of the research succeeding. Even in failure (often especially in failure) research programs succeed. Because our real goal is to learn something, not to sell something.

This does not mean that the artifact will never be a product. Sun has a long history of taking things from the lab that started as research artifacts and transfering them (often with the teams that built them) to the product organizations. The first Sparc workstation was an early example of this; the Java programming language is probably the best known. I've been part of these in the past, and it appears to the be trajectory of Project Darkstar. And the fact that the software is an artifact being used to further research does not mean that the software is written to a lesser standard-- indeed, I think that we have to take more care with this kind of software than we would have to take if it were a product (but that will be the subject of a future post).

Nor does this mean that you (as a game developer) shouldn't use the artifact that is Project Darkstar. If you find it useful (and it already deals with multiple cores rather nicely), there is no reason not to use it. The community and the development team give support. The technology is advancing. And there is the possibility that in the future there will be features in this technology that are genuinely unique.

But don't expect us to take the low-risk path. And don't expect to have a list of shipping games or running worlds that are using the technology, or the standard sort of proof-points that are the hallmark of a product. We don't have them, because we aren't a product. Or if we are, we are the kind of product that was common when the computer industry was much younger, when the notion of R&D meant that what was being done was both research and development.

Comments:

Hi, Jim. I suspect you missed whitespace/paragraph formatting information in your post. As-is, it is very hard to read.

Posted by Eduardo Pelegri-Llopart on August 07, 2009 at 11:02 AM PDT #

Hi Jim.

I've been reading your blog with interest and it is giving me an excellent overview of where PDS is going and why so thanks for that.

I do have one question about the social groups load balancing though.

In our game each player has multiple avatars that they can theoretically have active at any one time. Those avatars move independantly through the game world - so in many ways the avatar is a more natural unit than the players in terms of working out connections... but in other ways the player is. (i.e. locational connections go to the avatars and are different for each. Social connections such as guild/chat channels/etc go by the player).

How would the planned system cope with that? Would I need to generate separate logins to the darkstar system for each avatar or will the social networking structure be clever enough to seperate out the avatar objects?

Posted by TimB on August 10, 2009 at 03:27 AM PDT #

Hi Tim--

Well, we actually have a notion of an identity that we use to group tasks into units, and which we use to try to determine the social networks. How these identities map to actual users (or groups or users, or groups of avatars that are associated with a particular user) is, to some extent, under the control of the game server developer. Every login will be a different identity, but you can (or, maybe, will be able to-- not sure if this is in the trunk yet) create different identities (for things like NPOs) and could use the same sort of techniques to start off sets of tasks for different avatars. We (the core Darkstar code) can't figure out the right thing in those circumstances, so we do need some guidance from the programmer.

In general, we just assume that different logins are different identities, and anything more fancy than that needs to be supplied by the game developer.

Jim

Posted by Jim Waldo on August 10, 2009 at 04:41 AM PDT #

To follow up with Jim's comment, the ability to create new identities is included as part of the most recent 0.9.10 release. Have a look at the RunWithNewIdentity annotation. When the system runs any task that is annotated with this new annotation, it will generate a new identity for that task and any subsequent children tasks will also be given this new identity. So each of your avatars could potentially be branched off and given their own identities using this mechanism.

Posted by Owen Kellett on August 10, 2009 at 06:17 AM PDT #

Thanks guys, that sounds like it has potential although I'd need to look at it in more detail. (We are currently on 0.9.8 although planning to upgrade from that Soon[tm]).

It isn't something I'm going to worry about too much until we see if it actually poses a problem but it's nice to know we have some options if it does!

Posted by Tim B on August 10, 2009 at 07:12 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by Jim Waldo