Ramblings from Richard's Ranch

Sample RAIDoptimizer output

Tuesday Sep 02, 2008

We often get asked, "what is the best configuration for lots of disks" on the ZFS-discuss forum. There is no one answer to this question because you are really trading-off performance, RAS, and space.  For a handful of disks, the answer is usually easy to figure out in your head.  For a large number of disks, like the 48 disks found on a Sun Fire X4540 server, there are too many permutations to keep straight.  If you review a number of my blogs on this subject, you will see that we can model the various aspects of these design trade-offs and compare.

A few years ago, I wrote a tool called RAIDoptimizer, which will do the math for you for all of the possible permutations. I used the output of this tool to build many of the graphs you see in my blogs.

Today, I'm making available a spreadsheet with a sample run of the permutations of a 48-disk system using reasonable modeling defaults.  In this run, there are 339 possible permutations for ZFS.  The models described in my previous blogs are used to calculate the values.  The default values used are not representative of a specific disk, and merely represent ballpark, default values.  The exact numbers are not as important as the relationships exposed for when you look at different configurations.  Obviously, the tool allows us to change the disk parameters, which are usually available from disk data sheets.  But this will get you into the ballpark, and is a suitable starting point for making some trade-off decisions. 

For your convenience, I turned on the data filters for the columns so that you can easily filter the results. Many people also sort on the various columns.  StarOffice or OpenOffice will let you manipulate the data until the cows come home.  Enjoy.

[2] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Dependability Benchmarking for Computer Systems

Wednesday Aug 20, 2008

Over the past few years, a number of people have been working to develop benchmarks for dependability of computer systems. After all, why should the performance guys have all of the fun? We've collected a number of papers on the subject in a new book, Dependability Benchmarking for Computer Systems, available from the IEEE Computer Society Press and Wiley.

The table of contents includes:

  1. The Autonomic Computing Benchmark
  2. Analytical Reliability, Availability, and Serviceability Benchmarks
  3. System Recovery Benchmarks
  4. Dependability Benchmarking Using Environmental Test Tools
  5. Dependability Benchmark for OLTP Systems
  6. Dependability Benchmarking of Web Servers
  7. Dependability Benchmark of Automotive Engine Control Systems
  8. Toward Evaluating the Dependability of Anomaly Detectors
  9. Vajra: Evaluating Byzantine-Fault-Tolerant Distributed Systems
  10. User-Relevant Software Reliability Benchmarking
  11. Interface Robustness Testing: Experience and Lessons Learned from the Ballista Project
  12. Windows and Linux Robustness Benchmarks with Respect to Application Erroneous Behavior
  13. DeBERT: Dependability Benchmarking of Embedded Real-Time Off-the-Shelf Components for Space Applications
  14. Benchmarking the Impact of Faulty Drivers: Application to the Linux Kernel
  15. Benchmarking the Operating System against Faults Impacting Operating System Functions
  16. Neutron Soft Error Rate Characterization of Microprocessors

Wow, you can see that there has been a lot of work, by a lot of people to measure system dependability and improve system designs.

The work described in Chapter 2,  Analytical Reliability, Availability, and Serviceability Benchmarks, can be seen as we are beginning to publish these benchmark results in various product white papers:

Performance benchmarks have proven useful in driving innovation in the computer industry, and I think dependability benchmarks can do likewise. If you feel that these benchmarks are valuable, then please drop me a note, or better yet, ask your computer vendors for some benchmark results.

I'd like to thank all of the contributors to the book, the IEEE, and Wiley. Karama Kanoun and Lisa Spainhower worked tirelessly to get all of the works compiled (herding the cats) and interfaced with the publisher, great job! Ira Pramanick, Jim Mauro, William Bryson, and Dong Tang collaborated with me on Chapters 2 & 3, thanks team!

Like this post? del.icio.us | furl | slashdot | technorati | digg

Smartphones will rule the earth!

Thursday Aug 07, 2008

A few years ago, I put up the first, freely available WiFi hotspot in Ramona at the Ramona Cafe. I hope you think this was an altruistic move, but in reality it often saved me a commute into La Jolla, so it was well worth the investment. "How will we pay for it?" some asked, to which I often reply, "buy some pie!"

Naturally, I have the DHCP logs sent to me so that I can keep track of how it is being used.  At first, there were only a few, regular customers.  Now there are many regular customers.  For the most part, the type of machine being used is relatively obvious.  People have a tendency to name their laptops based on the vendor logo.  So I would see a number of devices named something like "Richard's Dell" or "Richard's Mac."  Others have boring names like "ASSIGNED LAPTOP 573" or some such... boring!

Today, however, I'm seeing a significant change in the machines connecting to the net at the cafe. I'm seeing a large number of "iPhone" and even a few "HTC-8900" devices stopping by.  This is very cool and reinforces the notion that once we can shrink computers to fit in your pocket, then everyone will want one.  Smart phones will rule the earth!

As an MBA who didn't do particularly well in marketing class, even I can say it is very obvious who does better branding between Apple (iPhone) and AT&T (tilt, aka HTC-8900).  Let's face it, something as intimate as a phone is going to get a human-like name.  "iPhone" works as an excellent brand.  "HTC-8900" sucks as brand and there is absolutely no connection between "tilt" and "HTC-8900."  I had to do a google search just to know that the "HTC-8900" is also known as a "tilt."  Why would someone even name a phone "tilt" which has the connotations of draining out of the pinball game.  Go figure. My marketing advice to AT&T, get with the program and work your brands!

Like this post? del.icio.us | furl | slashdot | technorati | digg

More Enterprise-class SSDs Coming Soon

Tuesday Aug 05, 2008

Sun has been talking more and more about enterprise-class solid-state disks (SSDs) lately. Even Jonathan blogged about it. Now we are starting to see some interesting articles hitting the press as various companies prepare to release interesting products for this market.

Today, CNET posted an interesting article by Brooke Crothers that offers some insight into how the consumer and enterprise class devices are diverging in their designs.  My favorite quote is, "One of the things that SSD manufacturers have been slow to learn (is that) you can't just take a compact flash controller, throw some NAND on there and call it an SSD," said Dean Klein, vice president of memory system development at Micron. Yes, absolutely correct.  Though Sun makes several products which offer compact flash (CF) for storage, the future of enterprise class SSDs is not re-badged CFs.  There are many more clever tricks that can be used to provide highly reliable, fast, and reasonably priced SSDs.

[2] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

ZFS Workshop at LISA'08

Monday Jul 14, 2008

We have organized a ZFS Workshop for the USENIX Large Installation Systems Administration (LISA'08) conference in San Diego this November. I hope you can attend.

The call for papers describes workshops as:
One-day workshops are hands-on, participatory, interactive sessions where small groups of system administrators have an opportunity to discuss a topic of common interest. Workshops are not intended as tutorials, and participants normally have significant experience in the appropriate area, enabling discussions at a peer level. However, attendees with less experience often find workshops useful and are encouraged to discuss attendance with the workshop organizer.

There is an opportunity to seed the discussions, so be sure to let me know if there is an interesting topic to be explored. 

The LISA conference is always one of the more interesting conferences for people who must deal with large sites as their day job. Many of the more difficult scalability problems are discussed in the sessions and hallways. If you are directly involved with the design or management of a large computer site, then it is an excellent conference to attend.

My first LISA was LISA-VI in 1992 where I presented a paper that Matt Long and I wrote, User-setup: A System for Custom Configuration of User Environments, or Helping Users Help Themselves, now hanging out on SourceForge. The original source was published on usenet -- which is how we did such things at the time. I suppose I could search around and find it archived somewhere...

Much has changed from the environments we had in 1992, but the problem of managing complex application environments continues to live on. I think that the more modern approaches to this problem, as clearly demonstrated by connected devices like the iPhone, is to leverage the internet and the browser-like interfaces to hide much of the complexity behind the scenes.  In a sense, this is the approach ZFS takes to managing disks -- hide some of the mundane trivia and provide a view of storage that is more intuitive to the users of storage. The more things change, the more the problems stay the same.

Please attend LISA'08 and join the ZFS workshop.

Like this post? del.icio.us | furl | slashdot | technorati | digg

Nice Article on Understanding Disk Reliability

Monday Jul 14, 2008

David Stone of eWeek published a nice article this week on understanding disk reliability. It covers in a short, concise manner many of the aspects of disks that I've blogged about here.  Enjoy.

Examining Disk Reliability Specs

Like this post? del.icio.us | furl | slashdot | technorati | digg

This Ain't Your Daddy's JBOD

Wednesday Jul 09, 2008

This morning, we announced the newest Just a Bunch of Disks (JBOD) storage arrays. These are actually very interesting products from many perspectives. One thing for sure, these are not like any previous JBOD arrays we've ever made. The simple, elegant design and high reliability, availability, and serviceability (RAS) features are truly innovative. Let's take a closer look...

Your Daddy's JBOD

In the bad old days, JBOD arrays were designed around bus or loop architectures. Sun has been selling JBOD enclosures using parallel SCSI busses for more than 20 years. There were even a few years when fibre channel JBOD enclosures were sold. In many first generation systems, they were not what I would call high-RAS designs.  Often there was only one power supply or fan set, but that really wasn't what caused many outages. Placing a number of devices on a bus or loop exposes them to bus or loop failures. If you hang a parallel SCSI bus, you stop access to all devices on the bus.  The vast majority of parallel SCSI bus implementations used single-port disks. The A5000 fibre channel JBOD attempted to fix some of these deficiencies: redundant power supplies, dual-port disks, two fibre channel loops, and two fibre channel hubs. When introduced, it was expected that fibre channel disks would rule the enterprise. The reality was not quite so rosy. Loops still represent a shared, single point of failure in the system. A misbehaving host or disk could still lock up both loops, thus rendering this "redundant" system useless. Fortunately, the power of the market drove the system designs to the point where the fibre channel loops containing disks are usually hidden behind a special-purpose controller. By doing this, the design space can be limited and thus the failure modes can be more easily managed.  In other words, instead of allowing a large variety of hosts to be directly connected to a large variety of disks in a system where a host or disk could hang the shared bus or loop, array controllers divide and conquer this problem by reducing the possible permutations.

This is basically where the market was, yesterday.  Array controllers are very common, but they represent a significant cost. The costs increase for high-RAS designs because you need redundant controllers with multiple host ports, driving the costs up.

Suppose we could revisit the venerable JBOD, but using modern technology?  What would it look like?

SCSI, FC, ATA, SATA, and SAS

When we use the term SCSI, most people think of the old, parallel Small Computer System Interconnect bus. This was a parallel bus implemented in the bad old days with the technology available then. Wiggling wire speed was relatively slow,  so bandwidth increases were achieved using more wires. It is generally faster to push photons through an optical fiber than to wiggle a wire, thus fibre channel (FC) was born. In order to leverage some of the previous software work, FC decided to expand the SCSI protocol to include a serial interface (and they tried to add a bunch of other stuff too, but that is another blog). When people say "fibre channel" they actually mean "serial SCSI protocol over optical fiber transport." Another development around this time was that the venerable Advanced Technology Attachment (ATA) disk interface used in many PCs was also feeling the strain of performance improvement and cost reductions.

Cost reductions?  Well, if you have more wires, then the costs will go up.  Connectors and cables get larger. From a RAS perspective, the number of failure opportunities increases. A (parallel) UltraSCSI implementation needs 68-pin connectors. Bigger connectors with big cables must be stronger and are often designed with lots of structural metal. Using fewer wires means that the connector and cables get smaller, reducing the opportunities for failure, reducing the strength requirements, and thus reducing costs.

Back to the story, the clever engineers said, well if we can use a protocol like SCSI or ATA over a fast, serial link, then we can improve performance, improve RAS, and reduce costs -- a good thing. A better thing was the realization that the same, low-cost physical interface (PHY) can be used for both serial SCSI and serial ATA (SATA). Today, you will find many host bus adapters (HBAs) which will support both SAS and SATA disks. After all, the physical connections are the same, it is just a difference in the protocol running over the wires.

One of the more interesting differences between SAS and SATA is that the SAS guys spend more effort on making the disks dual-ported.  If you look around, you will find single and dual-port SAS disks for sale, but rarely will you see a dual-port SATA disk (let me know if you find one).  More on that later... 

Switches, Yay!

Now that we've got a serial protocol, we can begin to think of implementing switches. In the bad old days, Ethernet was often implemented using coaxial cable, basically a 2-wire shared bus. All Ethernet nodes shared the same coax, and if there was a failure in the coax, everybody was affected. The next Ethernet evolution replaced the coax with point-to-point wires and hubs to act as a collection point for the point-to-point connections. From a RAS perspective, hubs acted similar to the old coax in that a misbehaving node on a hub could interfere with or take down all of the nodes connected to the hub. With the improvements in IC technology over time, hubs were replaced with more intelligent switches. Today, almost all Gigabit Ethernet implementations use switches -- I doubt you could find a Gigabit Ethernet hub for sale. Switches provide fault isolation and allow traffic to flow only between interested parties. SCSI and SATA followed a similar evolution. Once it became serial, like Ethernet, then it was feasible to implement switching.  RAS guys really like switches because in addition to the point-to-point isolation features, smart switches can manage access and diagnose connection faults.

J4200 and J4400 JBOD Arrays

Fast forward to 2008. We now have fast, switchable, redundant host-disk interconnect technology.  Let's build a modern JBOD. The usual stuff is already taken care of: redundant power supplies, redundant fans, hot-pluggable disks, rack-mount enclosure... done. The connection magic is implemented by a pair of redundant SAS switches. These switches contain an ARM processor and have intelligent management. They also permit the SATA Tunneling Protocol (STP) to move SATA protocol over SAS connections. These are often called SAS Expanders, and the LSISASx36 provides 36 ports for us to play with. SAS connections can be aggregated to increase effective bandwidth. For the J4200 and J4400, we pass 4 ports each to two hosts and 4 ports "downstream" to another J4200 or J4400. For all intents and purposes, this works like most other network switches. The result is that each host has a logically direct connection to each disk. Each disk is dual ported, so each disk connects to both SAS expanders. We can remotely manage the switches and disks, so replacing failed components is easy, even while maintaining service.  RAS goodness.

Dual-port SATA?

As I mentioned above, SATA disks tend to be single ported.  How do we connect to two different expanders? The answer is shown here. In the bill-of-materials (BOM) for SATA disks, you will notice a SATA Interposer card. This fits in the carrier for the disk and provides a multiplexor which will connect one SATA disk to two SAS ports. This is, in effect, what is built into a dual-port SAS disk. From a RAS perspective, this has little impact on the overall system RAS because the field replaceable unit (FRU) is the disk+carrier.  We don't really care if that FRU is a single-port SATA disk with an interposer or a dual-port SAS disk.  If it breaks, the corrective action is to replace the disk+carrier. Since each disk slot has point-to-point connections to the two expanders, replacing a disk+carrier is electrically isolated from the other disks.

What About Performance?

Another reason that array controllers are more popular than JBODs is that they often contain some non-volatile memory used for a write cache. This can significantly improve write-latency-sensitive applications. When Sun attempted to replace the venerable SPARCStorage Array 100 (SSA-100) with the A5000 JBOD, one of the biggest complaints was that performance was reduced. This is because the SSA-100 had a non-volatile write cache while the A5000 did not. The per-write latency difference was an order of magnitude. OK, so this time around does anyone really expect that we can replace an array controller with a JBOD?  The answer is yes, but I'll let you read about that plan from the big-wigs...


Postscript

I had intended to show some block diagrams here, but couldn't find any I liked that weren't tagged for internal use only.  If I find something later, I'll blog about it.


[2] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Awesome disk AFR! Or, is it...

Thursday Jun 26, 2008

I was hanging out in the ZFS chat room when someone said they were using a new Seagate Barracuda 7200.11 SATA 3Gb/s 1-TB Hard Drive. A quick glance at the technical specs revealed a reliability claim of 0.34% Annualized Failure Rate (AFR).  Holy smokes!  This is well beyond what we typically expect from disks.  Doubling the reliability would really make my day. My feet started doing a happy dance.

So I downloaded the product manual to get all of the gritty details. It looks alot like most of the other large, 3.5" SATA drive specs out there, so far so good. I get to the Reliability Section (section 2.11, page 18) to look for more nuggets.

Immediately, the following raised red flags with me and my happy feet stubbed a toe.

The product shall achieve an Annualized Failure Rate (AFR) of 0.34% (MTBF of 0.7 million hours) when operated in an environment of ambient air temperatures of 25°C. Operation at temperatures outside the specifications in Section 2.8 may increase the product AFR (decrease MTBF). AFR and MTBF are population statistics that are not relevant to individual units.


AFR and MTBF specifications are based on the following assumptions for desktop personal computer environments:
• 2400 power-on-hours per year.
...


Argv! OK, here's what happened. When we design enterprise systems, we use AFR with a 24x7x365 hour year (8760 operation hours/year). A 0.34% AFR using a 8760 hour year is equivalent to an MTBF of 2.5 million hours (really good for a disk). But the disk is spec'ed at 0.7 million hours, which, in my mind is an AFR of 1.25%, or about half as reliable as enterprise disks. The way they get to the notion that an AFR of 0.34% equates to an MTBF of 0.7 million hours is by changing the definition of operation to 2,400 hours per year (300 8-hour days). The math looks like this:

    24x7x365 operation = 8760 hours/year (also called power-on-hours, POH)

    AFR = 100% * (POH / MTBF)

For an MTBF of 700,000 hours,

    AFR = 100% * (8760 / 700,000) = 1.25%

or, as Seagate specifies for this disk:

    AFR = 100% * (2400 / 700,000) = 0.34%

The RAS community has better luck explaining failure rates using AFR rather than MTBF. With AFR you can expect the failures to be a percentage of the population per year. The math is simple and intuitive.  MTBF is not very intuitive and causes all sorts of misconceptions. The lesson here is that AFR can mean different things to different people and can be part of the marketing games people play. For a desktop environment, a large population might see 0.34% AFR with this product (and be happy).  You just need to know the details when you try to compare with the enterprise environments.

Unrecoverable Error on Read (UER) rate is 1e-14 errors/bits read, which is a bit of a disappointment, but consistent with consumer disks.  Enterprise disks usually claim 1e-15 errors/bits read, by comparison. This worries me as the disks are getting bigger because of what it implies.  The product manual says that there is guaranteed to be at least 1,953,525,168 512 byte sectors available.

    Total bits = 1,953,525,168 sectors/disk * 512 bytes/sector * 8 bits/byte= 8e12 bits/disk

If the UER is 1e-14 errors/bits read then you can expect an unrecoverable read once every 12.5 times you read the entire disk. Not a very pleasant thought, even if you are using a file system which can detect such errors, like ZFS.  Fortunately, field failure data tends to see a better UER than the manufacturers claim.  If you are worried about this sort of thing, I'll recommend using ZFS.

All-in-all, this looks like a nice disk for desktop use. But you should know that in enterprise environments we expect much better reliability specifications.

Like this post? del.icio.us | furl | slashdot | technorati | digg

on /var/mail and quotas

Friday Jun 06, 2008

About every other month or so, someone comes onto the ZFS forum, complains about quotas, and holds up the shared /var/mail directory as an example of where UFS quotas are superior to ZFS quotas. This is becoming very irritating as it makes an assumption about /var/mail which we proved doesn't scale decades ago.  Rather than trying to respond explaining this again and again, I'm blogging about it.  Enjoy.

When we started building large e-mail servers using sendmail in the late 1980s, we ran right into the problem of scaling the mail delivery directory.  Recall that back then relatively few people were on the internet or using e-mail, a 40MHz processor was leading edge, a 200 MByte hard disk was just becoming affordable, RAID was mostly a white paper, and e-mail attachments were very uncommon. It is often limited resources which cause scaling problems, and putting thousands of users into a single /var/mail quickly exposes issues.

Many sites implemented quotas during that era, largely because of the high cost and relative size of hard disks.  The computing models were derived from the timeshare systems (eg UNIX) and that model was being stretched as network computing was evolving (qv rquotad).  A common practice for Sun sites was to mount /var/mail on the NFS clients so that the mail clients didn't have to know anything about the network.

As we scaled, the first, obvious change was to centralize the /var/mail directory. This allowed you to implement a site-wide mail delivery where you could send mail to user@some.place.edu instead of user@workstation2.lab301.some.place.edu.   This is a cool idea and worked very well for many years. But it wasn't the best solution.

As we scaled some more, and the "administration" demanded quotas, we found that the very nature of distributed systems didn't match the quota model very well. For example, the "administration's" view was that a user may be given a quota of Q for the site. But the site now had many different file systems and a quota only really works on a single file system. We had already centralized everyone onto a single mail store and you needed some quota for the home directory and another subset of Q for the mail store. You also had to try and limit the quota on other home directories because the clever users would discover where the quotas weren't and use all of the space. Back at the mail store, it became increasingly more difficult to manage the space because, as everybody knows, the managers never delete e-mail and they complain loudly when they run out of space. So, quotas in a large, shared directory don't work very well.<\p>

The next move was to deliver mail into the user's home directory. This is trivially easy to setup in sendmail (now). In this model, the quota only needs to be set by the userin their home directory and when they run out, you can do work. This solution bought another few years of scalability, but still has its limitations. A particularly annoying limitation is that sending mail to someone who was over quota is not handled very well. And if the sys-admins use mail to tell people they are near quota, then it might not be deliverable (recall, managers don't delete e-mail :-)

There is also a potential problem with mail bombs.  In the sendmail model, each message was copied to each user's mailbox. In the old days, you could implement a policy where sendmail would reject mail messages of a large size. You can still do that today, but before attachments you could put the limit at something small, say 100 kBytes.  There is no way you can do that today. So a mischievous user could send a large mail message to everyone, blow out the /var/mail directory or the quotas.

A better model is to have only one copy of an e-mail message and just use pointers for each of the recipients. But while this model can save large amounts of disk space, it is not compatible with quotas because there is no good way to assign the space to a given user.

The next problem to be solved was the clients. Using an NFS mounted /var/mail worked great for UNIX users, but didn't work very well for PCs (which were now becoming network citizens). The POP and IMAP protocols fixed this problem.

Today mail systems can scale to millions of users, but not by using a shared file system or file system quotas. In most cases, there is a database which contains info on the user and their messages. The messages themselves are placed in a database of sorts and there is usually only one copy of the message. Mail quotas can be easily implemented and the mailer can reply to a sender explaining that the recipient is over mail quota, or whatever.  Automation sends a user a near-quota warning message.  But this is not implemented via file system quotas.

So, please, if you want to describe shared space and file system quotas, find some other example than mail. If you can't find an example, then perhaps we can drop the whole quota argument altogether.

If your "administration" demands that you implement quotas, then you have my sympathy.  Just remind them that you probably have more space in your pocket than quota on the system...

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

RAS in the T5140 and T5240

Wednesday Apr 09, 2008

Today, Sun introduced two new CMT servers, the Sun SPARC Enterprise T5140 and T5240 servers.

I'm really excited about this next stage of server development. Not only have we effectively doubled the performance capacity of the system, we did so without significantly decreasing the reliability. When we try to predict reliability of products which are being designed, we make those predictions based on previous generation systems. At Sun, we make these predictions at the component level. Over the years we have collected detailed failure rate data for a large variety of electronic components as used in the environments often found at our customer sites. We use these component failure rates to determine the failure rate of collections of components. For example, a motherboard may have more than 2,000 components: capacitors, resistors, integrated circuits, etc. The key to improving motherboard reliability is, quite simply, to reduce the number of components. There is some practical limit, though, because we could remove many of the capacitors, but that would compromise signal integrity and performance -- not a good trade-off. The big difference in the open source UltraSPARC T2 and UltraSPARC T2plus processors is the high level of integration onto the chip. They really are systems on a chip, which means that we need very few additional components to complete a server design. Fewer components means better reliability, a win-win situation. On average, the T5140 and T5240 only add about 12% more components over the T5120 and T5220 designs. But considering that you get two or four times as many disks, twice as many DIMM slots, and twice the computing power, this is a very reasonable trade-off.

Let's take a look at the system block diagram to see where all of the major components live.



You will notice that the two PCI-e switches are peers and not cascaded. This allows good flexibility and fault isolation. Compared to the cascaded switches in the T5120 and T5220 servers, this is a simpler design. Simple is good for RAS.

You will also notice that we use the same LSI1068E SAS/SATA controller with onboard RAID. The T5140 is limited to 4 disk bays, but the T5240 can accommodate 16 disk bays. This gives plenty of disk targets for implementing a number of different RAID schemes. I recommend at least some redundancy, dual parity if possible.

Some people have commented that the Neptune Ethernet chip, which provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a single point of failure. There is also one quad GbE PHY chip. The reason the Neptune is there to begin with is because when we implemented the coherency links in the UltraSPARC T2plus processor we had to sacrifice the builtin Neptune interface which is available in the UltraSPARC T2 processor. Moore's Law assures us that this is a somewhat temporary condition and soon we'll be able to cram even more transistors onto a chip. This is a case where high integration is apparent in the packaging. Even though all four GbE ports connect to a single package, the electronics inside the package are still isolated. In other words, we don't consider the PHY to be a single point of failure because the failure modes do not cross the isolation boundaries. Of course, if your Ethernet gets struck by lightning, there may be a lot of damage to the server, so there is always the possibility that a single event will create massive damage. But for the more common cabling problems, the system offers suitable isolation. If you are really paranoid about this, then you can purchase a PCI-e card version of the Neptune and put it in PCI-e slot 1, 2, or 3 to ensure that it uses the other PCI-e switch.

The ILOM service processor is the same as we use in most of our other small servers and has been a very reliable part of our systems. It is connected to the rest of the system through a FPGA which manages all of the service bus connections. This allows the service processor to be the serviceability interface for the entire server.

The server also uses ECC FB-DIMMs with Extended ECC, which is another common theme in Sun servers. We have recently been studying the affects of Solaris Fault Management Architecture and Extended ECC on systems in the field and I am happy to report that this combination provides much better system resiliency than possible through the individual features. In RAS, the whole can be much better than the sum of the parts.

For more information on the RAS features of the new T5140 and T5240 servers, see the white paper, Maximizing IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140 and T5240 Servers. The whitepaper has results of our RAS benchmarks as well as some performability calculations.



[1] Comments

more on holey files

Tuesday Apr 08, 2008

My colleague Christine asked me some questions about my holey files posts. These are really good questions, and I'm just a little surprised that more people didn't ask them... hey, that is what the comments section is for!  So, I thought I would reply publically, helping to stimulation some conversations.

Q1. How could you have a degraded pool and data corruption w/o a repair?  I assume this pool must be raidz or mirror.

A1. No, this was a simple pool, not protected at the pool level. I used the ZFS copies parameter to set the number of redundant data copies to 2. For more information on how copies works, see my post with pictures.

There is another, hidden question here.  How did I install Indiana such that it uses copies=2? By opening a shell and becoming root prior to beginning the install, I was able to set the copies=2 property just after the storage pool was created. By default, it gets inherited by any subsequent file system creation.  Simple as that.  OK, so it isn't that simple.  I've also experimented with better ways to intercept the zpool create, but am not really happy with my hacks thus far.  A better solution is for the installer to pick up a set of properties, but it doesn't, at least for now.

Q2.  Can a striped pool be in a degraded state?  Wouldn't a device faulting in that pool renders it unusable and therefore faulted?

A2. Yes, a striped storage pool can be in a degraded state. To understand this, you need to know the definitions of DEGRADED and FAULTED.  Fortunately, they are right there in the zpool manual page.

 

DEGRADED

One or more top-level vdevs is in the degraded state because one or more component devices are offline. Sufficient replicas exist to continue functioning.

...

FAULTED

One or more top-level vdevs is in the faulted state because one or more component devices are offline. Insufficient replicas exist to continue functioning.

...

By default, there are multiple replicas, so for a striped volume it is possible to be in a DEGRADED state. However, I expect that the more common case will be a FAULTED state. In other words, I do tend to recommend a more redundant storage pool: mirror, raidz, raidz2. 

Q3. What does filling the corrupted part with zero do for me?  It doesn't fix it, those bits weren't zero to begin with.

A3. Filling with zeros will just make sure that the size of the "recovered" file is the same as the original. Some applications get to data in a file via a seek to an offset (random access), so this is how you would want to recover the file.  For applications which process files sequentially, it might not matter.


Like this post? del.icio.us | furl | slashdot | technorati | digg

dd tricks for holey files

Thursday Mar 13, 2008

Bob Netherton took a look at my last post on corrupted file recovery (?) and asked whether I had considered using the noerror option to dd. Yes, I did experiment with dd and the noerror option.

The noerror option is described in dd(1) as:

    noerror Does not stop processing on an input error.
            When an input error occurs, a diagnostic mes-
            sage is written on standard error, followed
            by the current input and output block counts
            in the same format as used at completion. If
            the sync conversion is specified, the missing
            input is replaced with null bytes and pro-
            cessed normally. Otherwise, the input block
            will be omitted from the output.

This looks like the perfect solution, rather than my dd and iseek script. But I didn't post this because, quite simply, I don't really understand what I get out of it.

Recall that I had a corrupted file which is 2.9 MBytes in size. Somewhere around 1.1 MBytes into the file, the data is corrupted and fails the ZFS checksum test.

# zpool scrub zpl_slim
# zpool status -v zpl_slim
  pool: zpl_slim
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.

see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42 2008

config:
       NAME        STATE     READ WRITE CKSUM
       zpl_slim    DEGRADED     0     0     9
         c2t0d0s0  DEGRADED     0     0     9

errors: Permanent errors have been detected in the following files:
           /mnt/root/lib/amd64/libc.so.1
# ls -ls /mnt/root/lib/amd64/libc.so.1
4667 -rwxr-xr-x 1 root bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/libc.so.1

I attempted to use dd with the noerror flag using several different block sizes to see what I could come up with. Here are those results:

# for i in 1k 8k 16k 32k 128k 256k 512k
> do
>
dd if=libc.so.1 of=/tmp/whii.$i bs=$i conv=noerror
> done
read: I/O error
1152+0 records in
1152+0 records out
...
grond#
ls -ls /tmp/whii*
3584 -rw-r--r-- 1 root root 1835008 Mar 13 11:27 /tmp/whii.128k
2464 -rw-r--r-- 1 root root 1261568 Mar 13 11:27 /tmp/whii.16k
2320 -rw-r--r-- 1 root root 1184768 Mar 13 11:27 /tmp/whii.1k
4608 -rw-r--r-- 1 root root 2359296 Mar 13 11:27 /tmp/whii.256k
2624 -rw-r--r-- 1 root root 1343488 Mar 13 11:27 /tmp/whii.32k
7168 -rw-r--r-- 1 root root 3670016 Mar 13 11:27 /tmp/whii.512k
2384 -rw-r--r-- 1 root root 1220608 Mar 13 11:27 /tmp/whii.8k

hmmm... all of these files are of different sizes, so I'm really unsure what I've ended up with. None of them are the same size as the original file, which is a bit unexpected.

# dd if=libc.so.1 of=/tmp/whaa.1k bs=1k conv=noerror
read: I/O error
1152+0 records in
1152+0 records out
read: I/O error
1153+0 records in
1153+0 records out
read: I/O error
1154+0 records in
1154+0 records out
read: I/O error
1155+0 records in
1155+0 records out
read: I/O error
1156+0 records in
1156+0 records out
read: I/O error
1157+0 records in
1157+0 records out
# ls -ls /tmp/whaa.1k
2320 -rw-r--r-- 1 root root 1184768 Mar 13 11:12 /tmp/whaa.1k

hmmm... well, dd did copy some of the file, but seemed to give up after around 5 attempts and I only seemed to get the first 1.1 MBytes of the file. What is going on here? A quick look at the dd source (open source is a good thing) shows that there is a definition of BADLIMIT which is how many times dd will try before giving up. The default compilation sets BADLIMIT to 5. Aha! A quick download of the dd code and I set BADLIMIT to be really huge and tried again.

# bigbaddd if=libc.so.1 of=/tmp/whbb.1k bs=1k conv=noerror
read: I/O error
1152+0 records in
1152+0 records out
...
read: I/O error
3458+0 records in
3458+0 records out
^C I give up
# ls -ls /tmp/whbb.1k
6920 -rw-r--r-- 1 root root 3543040 Mar 13 11:47 /tmp/whbb.1k

As dd processes the input file, it doesn't really do a seek, so it can't really get past the corruption. It is getting something, because od shows that the end of the whbb.1k file is not full of nulls. But I really don't believe this is the data in a form which could be useful. And I really can't explain why the new file is much larger than the original. I suspect that dd gets stuck at the corrupted area and does not seek beyond it. In any case, it appears that letting dd do the dirty work by itself will not acheive the desired results. This is, of course, yet another opportunity...

[3] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Holy smokes! A holey file!

Wednesday Mar 12, 2008

I was RASing around with ZFS the other day, and managed to find a file which was corrupted.

# zpool scrub zpl_slim
# zpool status -v zpl_slim
  pool: zpl_slim
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 0h2m with 1 errors on Tue Mar 11 13:12:42 2008
config:
        NAME        STATE     READ WRITE CKSUM
        zpl_slim    DEGRADED     0     0     9
          c2t0d0s0  DEGRADED     0     0     9

errors: Permanent errors have been detected in the following files:
                /mnt/root/lib/amd64/libc.so.1

# ls -ls /mnt/root/lib/amd64/libc.so.1
4667 -rwxr-xr-x 1 root bin 2984368 Oct 31 18:04 /mnt/root/lib/amd64/libc.so.1

argv! Of course, this particular file is easily extracted from the original media, it does't contain anything unique. For those who might be concerned that it is the C runtime library, and thus very critical to running Solaris, the machine in use is only 32-bit, so the 64-bit (amd64) version of this file is never used. But suppose this were an important file for me and I wanted to recover something from it? This is a more interesting challenge...

First, let's review a little bit about how ZFS works. By default, when ZFS writes anything, it generates a checksum which is recorded someplace else, presumably safe. Actually, the checksum is recorded at least twice, just to be doubly sure it is correct. And that record is also checksummed. Back to the story, the checksum is computed on a block, not for the whole file. This is an important distinction which will come into play later. If we perform a storage pool scrub, ZFS will find the broken file and report it to you (see above), which is a good thing -- much better than simply ignoring it, like many other file systems will do.

OK, so we know that somewhere in the midst of this 2.8 MByte file, we have some corruption. But can we at least recover the bits that aren't corrupted? The answer is yes. But if you try a copy, then it bails with an error.

# cp /mnt/root/lib/amd64/libc.so.1 /tmp
/mnt/root/lib/amd64/libc.so.1: I/O error

Since the copy was not successful, there is no destination file, not even a partial file. It turns out that cp uses mmap(2) to map the input file and copies it to the output file with a big write(2). Since the write doesn't complete correctly, it complains and removes the output file. What we need is something less clever, dd.

# dd if=/mnt/root/lib/amd64/libc.so.1 of=/tmp/whee
read: I/O error
2304+0 records in
2304+0 records out
# ls -ls /tmp/whee
2304 -rw-r--r-- 1 root root 1179648 Mar 12 18:53 /tmp/whee

OK, from this experiment we know that we can get about 1.2 MBytes by directly copying with dd. But this isn't all, or even half of the file. We can get a little more clever than that. To make it simpler, I wrote a little ksh script:

#!/bin/ksh
integer i=0
while ((i < 23))
do
    typeset -RZ2 j=$i
    dd if=$1 of=$2.$j bs=128k iseek=$i count=1
    i=i+1
done

This script will write each of the first 23 128kByte blocks from the first argument (a file) to a unique filename as a number appended to the second argument. dd is really dumb and doesn't offer much error handling which is why I hardwired the count into the script. An enterprising soul with a little bit of C programming skill could do something more complex which handles the more general case. Ok, that was difficult to understand, and I wrote it. To demonstrate, I first appologize for the redundant verbosity:

# ./getaround.ksh libc.so.1 /tmp/zz
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
read: I/O error
0+0 records in
0+0 records out

1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
1+0 records in
1+0 records out
0+1 records in
0+1 records out
# ls -ls /tmp/zz.*
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.00
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.01
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.02
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.03
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.04
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.05
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.06
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.07
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.08
   0 -rw-r--r-- 1 root root      0 Mar 12 19:00 /tmp/zz.09
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.10
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.11
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.12
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.13
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.14
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.15
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.16
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.17
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.18
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.19
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.20
 256 -rw-r--r-- 1 root root 131072 Mar 12 19:00 /tmp/zz.21
 200 -rw-r--r-- 1 root root 100784 Mar 12 19:00 /tmp/zz.22

So we can clearly see that the 10th (128kByte) block is corrupted, but the rest of the blocks are ok. We can now reassemble the file with a zero-filled block.

# dd if=/dev/zero of=/tmp/zz.09 bs=128k count=1
1+0 records in
1+0 records out
# cat /tmp/zz.* > /tmp/zz
# ls -ls /tmp/zz
5832 -rw-r--r-- 1 root root 2984368 Mar 12 19:03 /tmp/zz

Now I have recreated the file with a zero-filled hole where the data corruption was. Just for grins, if you try to compare with the previous file, you should get what you expect.

# cmp libc.so.1 /tmp/zz+
cmp: EOF on libc.so.1

How is this useful?

Personally, I'm not sure this will be very useful for many corruption cases. As a RAS guy, I advocate many verified copies of important data placed on diverse systems and media. But most folks aren't so inclined. Everytime we talk about this on the zfs-discuss alias, somebody will say that they don't care about corruption in the middle of their mp3 files. I'm no audiophile, but I prefer my mp3s to be hole-less. So I did this little exercise to show how you can regain full access to the non-corrupted bits of a corrupted file in a more-or-less easy way. Consider this a proof of concept. There are many possible variations, such as filling with spaces instead of nulls when you are missing parts of a text file -- opportunities abound.

[3] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Big Clusters and Deferred Repair

Wednesday Feb 20, 2008

When we build large clusters, such as high performance clusters or any cluster with a large number of computing nodes, we begin to look in detail at the repair models for the system. You are probably aware of the need to study power usage, air conditioning, weight, system management, networking, and cost for such systems. So you are also aware of how multiplying the environmental needs of one computing node times the number of nodes can become a large number. This can be very intuitive for most folks. But availability isn't quite so intuitive. Deferred repair models can also affect the intuition of the design. So, I thought that a picture would help show how we analyze the RAS characteristics of such systems and why we always look to deferred repair models in their design.

To begin, we have to make some assumptions:

  • The availability of the whole is not interesting.  The service provided by a big cluster is not dependent on all parts being functional. Rather, we look at it like a swarm of bees. Each bee can be busy, and the whole swarm can contribute towards making honey, but the loss of a few bees (perhaps due to a hungry bee eater) doesn't cause the whole honey producing process to stop. Sure, there may be some components of the system which are more critical than others, like the queen bee, but work can still proceed forward even if some of these systems are temporarily unavailable (the swarm will create new queens, as needed). This is a very different view than looking at the availability of a file service, for example.
  • The performability will might be interesting. How many dead bees can we have before the honey production falls below our desired level? But for very, very large clusters, the performability will be generally good, so a traditional performability analysis is also not very interesting. It is more likely that a performability analysis of the critical components, such as networking and storage, will be interesting. But the performability of thousands of compute nodes will be less interesting.
  • Common root cause failures are not considered. If a node fails, the root cause of the failure is not common to other nodes. A good example of a common root cause failure is loss of power -- if we lose power to the cluster, all nodes will fail. Another example is software -- a software bug which causes the nodes to crash may be common to all nodes.
  • What we will model is a collection of independent nodes, each with their own, independent failure causes.  Or just think about bees.
For a large number of compute nodes, even using modern, reliable designs, we know that the probability of all nodes being up at the same time is quite small. This is obvious if we look at the simple availability equation:
Availability = MTBF / (MTBF + MTTR)

where, MTBF (mean time between failure) is MTBF[compute node]/N[nodes]
and, MTTR (mean time to repair) is > 0

The killer here is N. As N becomes large (thousands) and MTTR is dependent on people, then the availability becomes quite small. The time required to repair a machine is included in the MTTR. So as N becomes large, there is more repair work to be done. I don't know about you, but I'd rather not spend my life in constant repair mode, so we need to look at the problem from a different angle.

If we make MTTR large, then the availability will drop to near zero. But if we have some spare compute nodes, then we might be able to maintain a specified service level. Or, some a practical perspective, we could ask the question, "how many spare compute nodes do I need to keep at least M compute nodes operational?" The next, related question is, "how often do we need to schedule service actions?" To solve this problem, we need a model.

Before I dig into the model results, I want to digress for a moment and talk about Mean Time Between Service (MTBS) and Mean Time Between System Interruption (MTBSI).  I've blogged in detail about these before, but to put there use in context here, we will actually use MTBSI and not MTBF for the model.  Why? Because if a compute node has any sort of redundancy (ECC memory, mirrored disks, etc.) then the node may still work after a component has failed. But we want to model our repair schedule based on how often we need to fix nodes, so we need to look at how often things break for two cases. The models will show us those details, but I won't trouble you with them today.

The figure below shows a proposed 2000+ node HPC cluster with two different deferred repair models. For one solution, we use a one week (168 hour) deferred repair time. For the other solution, we use a two week deferred repair time. I could show more options, but these two will be sufficient to provide the intuition for solving such mathematical problems.

Deferred Repair Model Results 

We build a model showing the probability that some number of nodes will be down. The OK state is when all nodes are operational. It is very clear that the longer we wait to repair the nodes, the less probable it is that the cluster will be in the OK state. I would say, that that with a two week deferred maintenance model, there is nearly zero probability that all nodes will be operational. Looking at this another way, if you want all nodes to be available, you need to have a very, very fast repair time (MTTR approaching 0 time). Since fast MTTR is very expensive, accepting a deferred repair and using spares is usually a good cost trade-off.

OK, so we're convinced that a deferred repair model is the way to go, so how many spare compute nodes do we need? A good way to ask that question is, "how may spares do I need to ensure that there is a 95% probability that I will have a minumum of M nodes available?" From the above graph, we would accumulate the probability until we reached the 95% threshold. Thus we see that for the one week deferred repair case, we need at least 8 spares and for the two week deferred repair case we need at least 12 spares. Now this is something we can work with.

The model results will change based on the total number of compute nodes and their MTBSI. If you have more nodes, you'll need more spares. If you have more reliable or redundant nodes, you need fewer spares. If we know the reliability of the nodes and their redundancy characteristics, we have models which can tell you how many spares you need.

This sort of analysis also lets you trade-off the redundancy characteristics of the nodes to see how that affects the system, too. For example, we could look at the affect of zero, one, or two disks (mirrored) per node on the service levels. I personally like the zero disk case, where the nodes boot from the network, and we can model such complex systems quite easily, too. This point should not be underestimated, as you add redundancy to increase the MTBSI, you also increase the MTBS, which impacts your service costs.  The engineer's life is a life full of trade-offs.

 

In conclusion, building clusters with lots of nodes (red shift designs) requires additional analysis beyond what we would normally use for critical systems with few nodes (blue shift designs). We often look at service costs using a deferred service interval and how that affects the overall system service level. We also look at the trade-offs between per-node redundancy and the overall system service level. With proper analysis, we can help determine the best performance and best cost for large, red shift systems.

 

 

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Freak Valentine's Day Snowstorm

Sunday Feb 17, 2008

Every once in a while, they get it wrong.  Very wrong.  As a rancher, I tend to pay attention to the weather report. Though it doesn't rain very often in Southern California, it can still ruin your day, or at least make ranch chores a messy endeavor. This week had been a more typical week, mostly sunny, highs in the 60s-70s, lows in the 40s, last week's rains a distant memory. Today's forecast was more of the same, with a slight chance of drizzle in the morning as a cold front passed. "No big deal!" claimed meteorologist John Coleman. So, when morning came with a light sprinkle, we weren't really surprised. If it drizzles down at Lindbergh Field, where the official San Diego weather is measured, it might sprinkle up here in the mountains. No big deal. 

By lunchtime I figure we had about 1/4 of an inch of rain and was beginning to wonder when the sun would break through the clouds and bring the promised 70 degrees of sunshine. Alas, it was still mostly cloudy. Regina went into town to run some errands, while I joined a conference call.  During the call, I noticed that the wind was picking up, mostly from the northeast.  Normally when the winds blow from the northeast, the deserts, they are dry and will clear up any fog or drizzle rather quickly. But I noticed that during the conference call, it sounded like hail was hitting the window.

Then the lightning started.  OK, that was odd.  Sure we do get a thunderstorm every once in a while, and pea-sized hail often accompanies them. The wind was blowing stronger now and I was beginning to think that the drizzle forecast was a bit optimistic. One hour on the conference call down, hopefully we'll wrap up soon.

Suddenly, Regina burst into the office amongst a flurry of snow and ice, looking like Nanook coming in from a blizzard.
What the...?

"Hi sweetie!  Is it hail?"

"No! It is snowing and icing and I had to park down at the barn and walk up the hill to the house!"

Snow?!?  Sure enough, behind Regina it looked rather... white.  How can this be?  Forecast partly sunny, 70s.

After the call, I trudged outside to see what was up.  Sure enough, snow everywhere.  The wind was howling, and more snow was coming.  Absolutely no sign of the sun.  Rats!  I don't even like snow!

A quick look towards the highway confirmed that everything was falling apart. The few intrepid travelers were trying to negotiate the curves without kissing the boulders, and I knew my plans were dashed.  I had everything worked out well in advance. Conference call after lunch. Regina off running errands.  A quick dash into town to pick up the Valentine's Day flowers and gift.  Swing by the grocery for some fresh seafood and a nice bottle of wine. Dinner was going to be awesome, followed by sweet kisses. Now this. Snow!  If I wanted to live were it snowed, I would live somewhere else.  In the eight years here at the ranch, we'd only seen a few dustings at this altitude, nothing that would stick. It was nearly 70 degrees yesterday, there is no way this would stick, or so I hoped.

Now, I had to work on plan B. As a RAS guy, I always have a plan B and plan C, just in case, with a plan D for dire emergencies. We started the evening chores early, even though it was still snowing and blowing. By dusk it had mostly stopped snowing at the ranch as the storm passed to the south. I took a picture of Swanson, our Black Swan, who was not at all happy with the weather.

Swanson and the snow

 

Well, Valentine's dinner worked out ok. The flowers were a day late, but still pretty. We received about three inches of slushy snow, most of which melted before freezing later in the evening. The surprise snowstorm caused a bunch of accidents and stranded hundreds of motorists. The really odd thing was that none of the weather forecasters saw it coming. I'm sure they will blame the forecasting models or data collection, but at the end of the day, Swanson still won't believe them... they just blew it.

 

Like this post? del.icio.us | furl | slashdot | technorati | digg