Ramblings from Richard's Ranch

Big Clusters and Deferred Repair

Wednesday Feb 20, 2008

When we build large clusters, such as high performance clusters or any cluster with a large number of computing nodes, we begin to look in detail at the repair models for the system. You are probably aware of the need to study power usage, air conditioning, weight, system management, networking, and cost for such systems. So you are also aware of how multiplying the environmental needs of one computing node times the number of nodes can become a large number. This can be very intuitive for most folks. But availability isn't quite so intuitive. Deferred repair models can also affect the intuition of the design. So, I thought that a picture would help show how we analyze the RAS characteristics of such systems and why we always look to deferred repair models in their design.

To begin, we have to make some assumptions:

  • The availability of the whole is not interesting.  The service provided by a big cluster is not dependent on all parts being functional. Rather, we look at it like a swarm of bees. Each bee can be busy, and the whole swarm can contribute towards making honey, but the loss of a few bees (perhaps due to a hungry bee eater) doesn't cause the whole honey producing process to stop. Sure, there may be some components of the system which are more critical than others, like the queen bee, but work can still proceed forward even if some of these systems are temporarily unavailable (the swarm will create new queens, as needed). This is a very different view than looking at the availability of a file service, for example.
  • The performability will might be interesting. How many dead bees can we have before the honey production falls below our desired level? But for very, very large clusters, the performability will be generally good, so a traditional performability analysis is also not very interesting. It is more likely that a performability analysis of the critical components, such as networking and storage, will be interesting. But the performability of thousands of compute nodes will be less interesting.
  • Common root cause failures are not considered. If a node fails, the root cause of the failure is not common to other nodes. A good example of a common root cause failure is loss of power -- if we lose power to the cluster, all nodes will fail. Another example is software -- a software bug which causes the nodes to crash may be common to all nodes.
  • What we will model is a collection of independent nodes, each with their own, independent failure causes.  Or just think about bees.
For a large number of compute nodes, even using modern, reliable designs, we know that the probability of all nodes being up at the same time is quite small. This is obvious if we look at the simple availability equation:
Availability = MTBF / (MTBF + MTTR)

where, MTBF (mean time between failure) is MTBF[compute node]/N[nodes]
and, MTTR (mean time to repair) is > 0

The killer here is N. As N becomes large (thousands) and MTTR is dependent on people, then the availability becomes quite small. The time required to repair a machine is included in the MTTR. So as N becomes large, there is more repair work to be done. I don't know about you, but I'd rather not spend my life in constant repair mode, so we need to look at the problem from a different angle.

If we make MTTR large, then the availability will drop to near zero. But if we have some spare compute nodes, then we might be able to maintain a specified service level. Or, some a practical perspective, we could ask the question, "how many spare compute nodes do I need to keep at least M compute nodes operational?" The next, related question is, "how often do we need to schedule service actions?" To solve this problem, we need a model.

Before I dig into the model results, I want to digress for a moment and talk about Mean Time Between Service (MTBS) and Mean Time Between System Interruption (MTBSI).  I've blogged in detail about these before, but to put there use in context here, we will actually use MTBSI and not MTBF for the model.  Why? Because if a compute node has any sort of redundancy (ECC memory, mirrored disks, etc.) then the node may still work after a component has failed. But we want to model our repair schedule based on how often we need to fix nodes, so we need to look at how often things break for two cases. The models will show us those details, but I won't trouble you with them today.

The figure below shows a proposed 2000+ node HPC cluster with two different deferred repair models. For one solution, we use a one week (168 hour) deferred repair time. For the other solution, we use a two week deferred repair time. I could show more options, but these two will be sufficient to provide the intuition for solving such mathematical problems.

Deferred Repair Model Results 

We build a model showing the probability that some number of nodes will be down. The OK state is when all nodes are operational. It is very clear that the longer we wait to repair the nodes, the less probable it is that the cluster will be in the OK state. I would say, that that with a two week deferred maintenance model, there is nearly zero probability that all nodes will be operational. Looking at this another way, if you want all nodes to be available, you need to have a very, very fast repair time (MTTR approaching 0 time). Since fast MTTR is very expensive, accepting a deferred repair and using spares is usually a good cost trade-off.

OK, so we're convinced that a deferred repair model is the way to go, so how many spare compute nodes do we need? A good way to ask that question is, "how may spares do I need to ensure that there is a 95% probability that I will have a minumum of M nodes available?" From the above graph, we would accumulate the probability until we reached the 95% threshold. Thus we see that for the one week deferred repair case, we need at least 8 spares and for the two week deferred repair case we need at least 12 spares. Now this is something we can work with.

The model results will change based on the total number of compute nodes and their MTBSI. If you have more nodes, you'll need more spares. If you have more reliable or redundant nodes, you need fewer spares. If we know the reliability of the nodes and their redundancy characteristics, we have models which can tell you how many spares you need.

This sort of analysis also lets you trade-off the redundancy characteristics of the nodes to see how that affects the system, too. For example, we could look at the affect of zero, one, or two disks (mirrored) per node on the service levels. I personally like the zero disk case, where the nodes boot from the network, and we can model such complex systems quite easily, too. This point should not be underestimated, as you add redundancy to increase the MTBSI, you also increase the MTBS, which impacts your service costs.  The engineer's life is a life full of trade-offs.

 

In conclusion, building clusters with lots of nodes (red shift designs) requires additional analysis beyond what we would normally use for critical systems with few nodes (blue shift designs). We often look at service costs using a deferred service interval and how that affects the overall system service level. We also look at the trade-offs between per-node redundancy and the overall system service level. With proper analysis, we can help determine the best performance and best cost for large, red shift systems.

 

 

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Freak Valentine's Day Snowstorm

Sunday Feb 17, 2008

Every once in a while, they get it wrong.  Very wrong.  As a rancher, I tend to pay attention to the weather report. Though it doesn't rain very often in Southern California, it can still ruin your day, or at least make ranch chores a messy endeavor. This week had been a more typical week, mostly sunny, highs in the 60s-70s, lows in the 40s, last week's rains a distant memory. Today's forecast was more of the same, with a slight chance of drizzle in the morning as a cold front passed. "No big deal!" claimed meteorologist John Coleman. So, when morning came with a light sprinkle, we weren't really surprised. If it drizzles down at Lindbergh Field, where the official San Diego weather is measured, it might sprinkle up here in the mountains. No big deal. 

By lunchtime I figure we had about 1/4 of an inch of rain and was beginning to wonder when the sun would break through the clouds and bring the promised 70 degrees of sunshine. Alas, it was still mostly cloudy. Regina went into town to run some errands, while I joined a conference call.  During the call, I noticed that the wind was picking up, mostly from the northeast.  Normally when the winds blow from the northeast, the deserts, they are dry and will clear up any fog or drizzle rather quickly. But I noticed that during the conference call, it sounded like hail was hitting the window.

Then the lightning started.  OK, that was odd.  Sure we do get a thunderstorm every once in a while, and pea-sized hail often accompanies them. The wind was blowing stronger now and I was beginning to think that the drizzle forecast was a bit optimistic. One hour on the conference call down, hopefully we'll wrap up soon.

Suddenly, Regina burst into the office amongst a flurry of snow and ice, looking like Nanook coming in from a blizzard.
What the...?

"Hi sweetie!  Is it hail?"

"No! It is snowing and icing and I had to park down at the barn and walk up the hill to the house!"

Snow?!?  Sure enough, behind Regina it looked rather... white.  How can this be?  Forecast partly sunny, 70s.

After the call, I trudged outside to see what was up.  Sure enough, snow everywhere.  The wind was howling, and more snow was coming.  Absolutely no sign of the sun.  Rats!  I don't even like snow!

A quick look towards the highway confirmed that everything was falling apart. The few intrepid travelers were trying to negotiate the curves without kissing the boulders, and I knew my plans were dashed.  I had everything worked out well in advance. Conference call after lunch. Regina off running errands.  A quick dash into town to pick up the Valentine's Day flowers and gift.  Swing by the grocery for some fresh seafood and a nice bottle of wine. Dinner was going to be awesome, followed by sweet kisses. Now this. Snow!  If I wanted to live were it snowed, I would live somewhere else.  In the eight years here at the ranch, we'd only seen a few dustings at this altitude, nothing that would stick. It was nearly 70 degrees yesterday, there is no way this would stick, or so I hoped.

Now, I had to work on plan B. As a RAS guy, I always have a plan B and plan C, just in case, with a plan D for dire emergencies. We started the evening chores early, even though it was still snowing and blowing. By dusk it had mostly stopped snowing at the ranch as the storm passed to the south. I took a picture of Swanson, our Black Swan, who was not at all happy with the weather.

Swanson and the snow

 

Well, Valentine's dinner worked out ok. The flowers were a day late, but still pretty. We received about three inches of slushy snow, most of which melted before freezing later in the evening. The surprise snowstorm caused a bunch of accidents and stranded hundreds of motorists. The really odd thing was that none of the weather forecasters saw it coming. I'm sure they will blame the forecasting models or data collection, but at the end of the day, Swanson still won't believe them... they just blew it.

 

Like this post? del.icio.us | furl | slashdot | technorati | digg