Slouching towards multi-node
For some time now, we in the Project Darkstar group have been thinking about how to distribute a game server over multiple machines in a way that doesn't require the game programmer to have to know about the number of machines on which the server is running. This is really the second (and in some ways most important) step of our scaling strategy. The first step was to allow multiple threads (and therefore multiple cores) to be used without the programmer thinking about threads or cores. This is mostly done, although the programmer does have to think about how to design his or her game in a way that maximizes concurrent access to the data. Now we are trying to spread this out to multiple machines.
The core idea is really pretty simple. Everything that happens inside a game server written on top of Darkstar happens in a task, which is written in Java. When we say "written in Java," what we really mean is that the task compiles to Java bytecodes that can be run on a Java virtual machine; we don't actually care what the source code looks like. These tasks need to access data through the Darkstar data store, and do their communication using Darkstar sessions or channels. This gives us (the folks writing the base Darkstar infrastructure) the ability to move these tasks around from machine to machine. The code is easy to move (since it is Java code). We have to make sure that the data can be accessed from any machine running Darkstar, but that's part of the data service. We have to make changes so that the endpoints of the channel or session will be moved with the task, but that's a mere matter of programming. In fact, all of this has been possible for some time (about a year).
But running different tasks on different machines as part of the same game is the easy part. The hard part is deciding when to move a task or set of tasks, and making the whole thing run fast (or, more precisely, insuring that the latency between a message to the server and a response to the client is minimized, and in no case worse than some upper limit). And this turns out to be pretty interesting; enough so that we can call it research (we are, after all, a part of Sun Labs).
Keeping the latency low requires that almost all of what gets done in any task gets done on a single node. This means that all of the data that is being manipulated or read by a task needs to be cached at the node. So we need to be able to figure out how to position tasks and data so that they are co-located. This might be simple if only a single task or set of tasks was manipulating or reading any piece of data, but in games it is often the case that multiple players (and therefore multiple tasks) are interacting with some of the same data. So there are groups of players that may need to be co-located with any set of data, and we need to be able to make sure that this happens in spite of the fact that which tasks are operating on which data sets changes. We have, in effect, two unknowns (which tasks should be on a particular server, and which objects should be on that server) with no equations. Which makes the solution tricky.
Our current approach starts by making a couple of assumptions. The first is that we can group tasks by the entities on whose behalf the tasks are being performed. For example, all of the tasks that are associated with a particular player will be grouped together and run on a particular server, as will all of the tasks associated with an AI creature or other NPC. This makes our life simpler, as it greatly reduces the number of things we have to worry about (there are lots of tasks per player), but also seems like a reasonable grouping for guaranteed co-location. It also gives us a notion of history that we can use to see what is happening-- if a player is manipulating some set of objects, the tasks associated with that player need to be co-located with that player, and will probably need to be co-located for some time.
This also means that rather than moving individual tasks to balance load, we will be moving all of the tasks that are associated with a particular identity. Since channels and sessions are also associated with identities, this makes the unit of movement make sense. So our current work is based on the idea that identities will cluster tasks, channels, and sessions; and that those clusters are what get moved to balance load.
But those aren't the only things that get moved. Sometimes, we will also need to move data. It is the nature of games that sometimes one player (or set of players) will be manipulating some set of data, and then later that same data set will be manipulated by a different player or set of players. If the players haven't been moved, the set of data will need to be moved to be co-located with the players that are manipulating the data. Whether to move the data or the set of tasks will depend on which move is cheaper, among other things. So we still have two unknowns (move the data, move the player) with no equations, but the granularity is larger.
There are really four problems that need to be addressed to deal with all of this (or, perhaps more honestly, four problems that we are currently addressing in our attempt to solve these problems):
- How do we co-locate data with the tasks or players that are using that data while still keeping the data at least conceptually as a shared resource available from any node on the network?
- How do we move a set of tasks from one server to another without causing a disruption of play for the client being moved?
- How do we decide when a server is being overloaded and therefore needs to have some players (and, perhaps, data) moved from that server to another?
- When a server is overloaded, how do we decide which players (and, perhaps, data) to move to another node?