keith thompson

tales from the darkstar


« Case study: Improvin... | Main | I can write that... »
Friday Oct 03, 2008

Two wrongs don't make a right, ...

...but five rights will get you back on the highway...1 One of the givens of Project Darkstar application development that I've picked up is that if you know you are going to modify a ManagedObject you should inform the server as early as possible. A related belief is that turning off modification detection should be a good thing. (See Seth's blog for additional comments) With these two beliefs in mind we wanted to verify them using the Project Snowman game server. Seth has posted a good description of the ins and outs of managed objects and serialization, and why the two claims above should be true. What I want to cover here is how the theory was applied to a "real", OK "demo" application, how it was tested, and what were the results.

With the goal of being able to turn off Darkstar's modification detection, the first order of business is to make sure all modified managed objects (MOs) are marked for update before the task ends. This is done in one of two ways, if the MO is in hand, you can make a call to DataManaged.markForUpdate(). If you are getting a MO from a ManagedReference you can call ManageReference.getForUpdate()instead of get(). Both of these calls will inform the server that these obects will be modified and shoulw be made persistent when the task ends. Currently the best way (only way) to go about adding these calls to your application is via code inspection.  With the Snowman server this appeared easy because of the small number of classes. I suppose there are many ways to approach the source but what made some sense was to take a top down look, starting with the AppListener and ClientSessionListener classes. After a generous sprinkling of markForUpdate() calls, the next step is to test whether everything has been properly marked. Luckily, Darkstar provides a little help with this. The modification detection code makes a log entry each time the check finds an object that has been modified (but not marked). You can enable this logging by adding the following to the server's logging properties:

com.sun.sgs.impl.service.data.DataServiceImpl.detect.modifications.level = ALL

The logging messages includes the class of the object that was found to be modified buy not much more. Also, the logging is made in the modification detection code (as you would expect) which is happening well after your task has ended. So the logs tell you that an object has been modified, but not where or by whom. This made finding that one last pesky state modification a challenge.2 You also have to be careful that all code paths are executed. The snowman server is fairly simple, but in a much more complex application this could be quite difficult to verify.

With the code properly marked for update it was time to test it with the detection turned off. This is done by setting the following server property:

com.sun.sgs.impl.service.data.DataServiceImpl.detect.modifications=false

In my previous post I mentioned using the profile listener SnapshotProfileListener. Though useful, the data form this profiler is fairly course grained. To get a more detailed view into the server I used the AggregateProfileListener which produces data on task execution over the life of the server. As before the listener can is enabled by adding this to the server's properties:

com.sun.sgs.impl.kernel.profile.listeners=\
com.sun.sgs.impl.profile.listener.AggregateProfileListener:\
com.sun.sgs.impl.profile.listener.SnapshotProfileListener

Note that this property enables two listeners. The client simulator displays output from the SnapshotProfileListener in its control window. The output from AggregateProfileListener can be viewed by connecting to port 43005 via telnet:

telnet server.host 43005

A portion of its output looks like this:

TaskCounts:
TotalTasks=4000162  AvgLength=0.37ms
Transactional=2194463  AvgLength=0.60ms
Successful=3875770  AvgLength=0.38ms  AvgStartDelay=12.47ms  AvgRetries=1.03

The interesting number here are the task lengths. My test procedure was to run 1000 simulated clients against the server and wait for one million, and then four million tasks to execute. I figured that this was enough for the JVM to be "warmed-up" and any server initialization to be minimized in the averages. I ran the server with the modification detection enabled (the default) and then turned-off. At the 1 and 4 million task marks I checked the task average lengths expecting to see an improvement (shorter task times) when modification detection was disabled.3 To my surprise (the first of several) there was no significant difference. The surprise was short lived when I realized that in the snowman server it is rare for an object to be accessed but not modified. So in the default case, very few checks were wasted and removing those extra checks (by marking the objects) had very little impact.

OK, things made sense, for at least ten minutes

My original surprise had turned to disappointment and in an effort to justify my work to this point I wondered how much better the server with markForUpdate() calls and no detection preformed over the server with no calls. I grabbed the original server code, and to make sure I would see the biggest difference between the two approaches I removed all of the markForUpdate() calls from the server. I then tested the two servers and indeed found a significant performance difference - the server without the detection was... slower!? My surprise here took somewhat longer to dissipate.

To understand what was happening, lets recap (from Seth's post) how the modification detection works. When a task ends, any object accessed and not marked for update is serialized and its serialized form is compared with the original. If they are different, the object is marked for update and is written-out when the task's transaction ends. In order to write-out an object it must be serialized, so in the case of a check discovering that an object has been modified the only extra overhead was the compare. Only if the object was not changed is the serialization a wasted step. In the snowman's case all of the objects are pretty small, so the compare was cheap.

The second piece of the puzzle is what markForUpdate() was doing. It turns out that markForUpdate() causes an operation in the data store (and ultimately the DB) so is not without cost. So in the two servers we had the following series of operations for objects that are modified:

A. Server with markForUpdate() calls
  1. get managed object - DB operation (returns serialized object)
  2. markForUpdate() - DB operation
  3. serialize (marked) object
  4. write serialized object - DB operation

B. Server using only modification detection

  1. get managed object - DB operation (returns serialized object)
  2. serialize object
  3. compare with original
  4. write serialized object - DB operation
The extra operation in A is step 2, the markForUpdate() and associated DB operation. The extra operation in B is step 3, comparing two serialized objects. It seemed clear from the data that the DB operation is much more expensive than the compare. Though understandable it does fly in the face of current thinking in regard to marking objects. Could it be that marking is always a bad thing? Though I do not (yet) have data to back it up, it would seem intuitive that if the application access lots of objects for read-only, trying to skip the serialization in an unnecessary check could be of benefit. Another thing to consider is whether or not you know for sure that an object will be modified before it is accessed. In these cases calling DataManager.getForUpdate() instead of get() + markForUpdate() should be an improvement. Which brings us to my next discovery.

The ride is not quite over

When I discovered that modification detection was faster than marking objects I wanted to see how the difference scaled. The default settings of the snowman server is that two clients are needed for a game to star and each game has two robot players. At a 1000 clients that comes out to 500 games and 2000 players in all (1k clients + 1k robots). At that load the server takes up about 25% of the test host. An easy way to increase the load on the snowman server is to increase the number of robots per game. I could increase the number of clients, but that would require another test machine to run a second client simulator. Taking the path of least resistance, I set the number of robots per game to 10 and re-ran the tests. Yet again, the results were surprising. The markForUpdate() server was faster than the server doing modification detection. Completely opposite the results with two robots per game!

Each snowman game is independent and other then login, there is very little contention in the system outside of the game. Within the game, player objects can contend for each other and the flag objects. So the difference between the server with two robots per game and 10 per game is more than just load, it is also the amount of contention. Given this, the theory on the performance reversal is due to there being a lot more contention and marking for update helps because conflict is found sooner rather then later. When objects are marked, conflict can be detected at the markForUpdate() call, and if an object is marked, some other task trying to access it (for read or write) will block, preventing that task from going forward, only to fail later. Thats my theory and I'm stikcing to it. At least until I run the next test...4

Things don't always turn out like you expect

The take-away from all of this is, if there is one, is that marking for update is bad, unless it is good. Actually the real moral here is that applications are different and you will need to test with different options to find what is best. Also, sometimes code behaves different than you expect, even if you helped write it.


1"Two wrongs don't make a right, but five rights will get you back on the highway..." refers to the rule from way back, used by Sun East Coast travelers to remember how to get on Rt 101 south from the rental car lot at the San Francisco airport. I often think of it when I see something take a lot of twists and turns. Original author unknown. (Note: things have changed A LOT at SFO, so the rule is no longer valid. At least for getting onto 101)

2The challenge turned out to be one of those DOH! moments. I was certain I had marked all of the cases where the robot changed their own game state (position, hit points, etc). I was also careful to not mark the cases where the game state was not modified. No sense in writing an object that was not changed. Now the way the robot moves is that a task is run periodically that causes the robot to move, attack, etc. and the way that task is scheduled is by calling TaskManager.scheduleTask() at the end of the task. The actual scheduling code looks like this:


private void scheduleMove(int delay) {
    AppContext.getTaskManager().scheduleTask(
        new MoveTask(appContext.getDataManager().createReference((RobotImpl)this)),
        delay + random.nextInt(500));
}

The reason the task reschedules itself, instead of using TaskManager.schedulePeriodicTask() when the robot is created is so I can add some randomness into the robot's movement, i,e, the delay + random.nextInt(500) in the code above. This line causes the robots to move after a delay that varies by a half of a second. And this was my problem. Calling random.nextInt() changed the state of the random object, and therefore the state of the robot. This happens every time the robot reschedules itself, even if the robot did not change its game state!

I briefly considered using a static random number generator such as Math.random() but this would be a synchronizing point for all the robots in the server. Not a good thing. Also, it turns out that the robot task almost always changes its game state. The only cases where the game state is not modified is if the task runs while the game is being initialized and if it runs after the game ends. In that case the the task is not rescheduled. So, after all of this, what made sense was to remove all of the markForUpdate() calls in the robot and simply do a getForUpdate() when the task starts! Check out the run() method in the robot's move task:

static private class MoveTask implements Task, Serializable {
    private static final long serialVersionUID = 1L;
    final ManagedReference<RobotImpl> robotRef;
        
    MoveTask(ManagedReference<RobotImpl> robotRef) {
        this.robotRef = robotRef;
    }

    public void run() throws Exception {
        try {
            robotRef.getForUpdate().moveRobot();
        } catch (ObjectNotFoundException gameDone) {}
    }
}

3I did verify that the task length being reported does include the activity that occurs outside of the application code and includes the modification detection.

4Note that all of the tests described in this note were run with the server using Berkeley DB Java Edition. The C version behaves somewhat differently due to in the way the two version preform locking and can produce very different results when running the same application. At two robots per game, the two versions showed similar relative behavior across the different tests. At 10 robots per game the C version failed to handle the load when modification detection was enabled. For the snowman game the JE seems to scale much better than the C version. Exactly why is not completely understood, and may be worth a post just on that topic. So for now, deciding which DB is best for any given application is hard without actually testing both.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

Today's Page Hits: 12