by Rick Palkovic
Sun Staff Writer
How do you improve the ability of your JEE web application to handle large numbers of transactions? Broadly speaking, the choices are to scale-up or scale-out.
Scaling up means expanding the resources of a fixed number of servers, by adding memory, mass storage, or increasing processor speed. Scaling out means increasing the number of commodity servers in the deployment. Each approach has its proponents, but scaling out has become best practice for most situations.
After committing to a scale-out strategy, what is the best way to execute it? Cameron Purdy tackled this problem in TS-6339, "Top 10 Patterns for Scaling Out Java Technology-Based Applications," in which he weighed the merits of several scale-out strategies and technologies. It’s definitely worth checking out the presentation slides and transcript when they become available on the JavaOne site, if only for the humor. Meanwhile, here are a few of Purdy’s thoughts.
The Problem
Scaling is an issue only for stateful data systems. For stateless data, you need do little more than add servers to linearly increase service. For stateful data, the system must be partitioned for scale-out, and the partitions must be coordinated.
Unfortunately, scaling out stateful systems adds complexity, and greater scale means greater complexity. Complexity increases the path length data ultimately has to travel to reach the user. And long data paths increase the opportunities for latency, the bane of large-scale deployments.
Multithreaded Thinking
Most of us learned to program in a single-threaded mindset, where data predictably goes where it’s sent and functions are called in order. A highly scaled deployment is by its nature multithreaded, and the timing of data migration and function calls is determined by an automated manager. Any requirement to execute function calls in a particular order creates a queue, and by definition a latency bottleneck.
Latency
Latency is at its worst when it can’t be predicted. The more complex the system, the more difficult it is to guarantee that a request will be handled in a given period of time. This issue is not particularly important for many applications, but absolutely critical for time sensitive transactions such as financial transactions.
Java has a real-time spec, but in general Java is not deterministic from a latency viewpoint, chiefly for the following reasons:
- It collects garbage at seemingly random times
- It’s a multithreaded system, and your thread could be preempted
- I/O appears nondeterministic
For these reasons, Java is rarely used for true real-time systems.
Read Consistency
Another issue is read consistency. In distributed systems only two things can be moved around: state data information and the functions that process it. When you move state data in a distributed system, whether with caching or some kind of synchronous transaction system, you have two copies of the data, at least temporarily, that must stay in sync.
Durability
When you create a system that must be continuously available, it requires some form of durability. For example, you must be able to recover servers, safeguard data from loss, and so on. Durability presents a couple of challenges. First, there is a latency cost to keeping data on a disk network. Second, you have to coordinate between readers and writers.
Architectural Limits to Scalability
High-latency operations that must occur in sequence limit scalability. Globally ordered operations limit scalability. Data hotspots limit scalability. In short, if there is some point in your system where data has to line up for processing, you will have a scalability bottleneck.
In stateful systems, the challenge is to achieve availability, reliability, performance, and predictability -- including predictable performance -- in systems that can still be easily managed and serviced.
There are costs to meeting these challenges. The goal is a system that scales linearly as you add resources and capacity. In other words, each additional server you add will support at least the same number of users as an existing server.
The Solution
The collective solution is composed from some or all of these approaches: routing, partitioning, replication for availability, coordination across multiple partitions, and messaging for asynchronous updates across partitions.
Routing
In a stateful system, the purpose of routing is to make sure you bring together processing and data at the right time, in the right place. Routing moves information to servers where there is locality of state. Locality of state lets processing occur without coordination across the network.
Routing lets you take advantage of additional servers by spreading the incoming load. Routing supports partitioning, the ability to break down processing across multiple servers.
Partitioning
Partitioning designates ownership of state and responsibilities across a distributed system. Because tasks are partitioned, a single-point failure will only affect a portion of the system. In a best-case scenario, through replication, you can eliminate the loss of any information. So, partitioning makes replication possible.
Replication for Availability
But once you have partitioning, you have a big problem, because it poses a limit to scale-out. Once you start pulling information from multiple servers and pushing information to them, you have issues of transaction that involve possible data loss and inaccuracy. Once you start coordinating across partitions and servers, things become much more complicated. Enter the issue of coordination.
Coordination Across Multiple Partitions
Coordination is what makes scalability difficult for stateful systems. But good coordination simplifies the programming and supports more synchronous behavior. The alternative to using coordination is to use messaging. So, instead of coordinating synchronous operations across multiple partitions, you can use messaging to make the processing stay synchronous.
Messaging for Asynchronous Updates Across Partitions
Messaging queues let you reliably hand off responsibilities between partitions. If one partition needs to initiate processing in another partition, it can send a high priority message. However, actually waiting for confirmation that the processing has been completed would be coordination.
Message queues are durable because they make message information survivable. If the sending partition dies, the message will not be lost, and the receiving partition will process as requested.
Message queues support ordering of operations. Sometimes it’s important to perform operations in sequence -- a challenge in multithreaded systems.
Queuing has one more important aspect that supports scalability. It lets you batch operations together.
Messaging Technologies
There are many messaging options, with varying reliability, latency, and ordering guarantees.
- UDP/IP: A low level protocol, but with low reliability -- there is no guarantee that messages will actually arrive, since they can be dropped without raising an error condition. Good for predictable latency, however, since messages are not stored.
- TCP/IP: More reliable than UDP, since the protocol establishes a two-way stream-oriented conversation. TCP/IP also has an easier programming model than UDP. One disadvantage is that latency depends on network load, and is unpredictable for practical purposes.
- HTTP: Has the advantage of simplicity because it is a slightly higher level protocol. Its request-response model makes it fairly reliable.
Conclusion
In stateful systems, the challenge is to achieve availability, reliability, performance, predictability, including predictable performance, in systems that can still provide manageability and serviceability.
The collective solution is composed from some or all of these: routing, partitioning, replication for availability, coordination across multiple partitions, and messaging for asynchronous updates across partitions.
And, keep it simple. If you can’t show how your deployment works on a white board, it probably won’t work in production.
Learn more about scale-out deployments on developers.sun.com.