SysBlog
SysBlog
Notes from Storage R&D

20050211 Friday February 11, 2005

OOO Well I have no hair left. OpenOffice paragraph numbering is completely frustrating. I have one document where no combination of numbering features will provide a basic nested numbered outline (1., 1.1., 1.1.1., etc). In fact there are 3 different ways to set numbering (format an outline, change a paragraph style, format paragraph) and they all seem completely disconnected.

I have a second document where everything behaves *perfectly*; even the paragraph styles change appropriately as things are promoted/demoted. But for this second document, all of the numbering features in all 3 places are completely turned *off*. I have no idea how they got this all to work. Grrrrrrr. Then there's the whole issue of a mysterious numbering toolbar that clearly exists, but can't be displayed. nice. ( Feb 11 2005, 09:06:41 AM PST ) Permalink Comments [0]

20050209 Wednesday February 09, 2005

What's Up With EMC Centera?? A couple interesting articles just came out...
EMC dodges question on Centera performance
Security flaw could put EMC Centera users at risk
Scalability Hampers Large Email Archives

1) We've known from the beginning that Centera was pretty slow, but that was typically ok for the archival-oriented applications it was serving. We think this has mostly to do with the asymmetry of the clustered architecture, and something that they can only incrementally improve through the use of faster Linux boxes.

2)Then we started hearing about hash collisions in the field causing data loss. The statisticians have always reported that the odds of a hash collision are extremely small (like 1/(2^big number)), but that assumes that data is random. In fact the applications are extremely homegeneous (eg 100 million 10KB check images) and now collisions are happening. Hash collisions can affect the unwary in different ways so it's not clear what mechanism is causing the data loss, perhaps the de-duplication process that deletes redundant versions of what are perceived to be identical files. Perhaps it's a read operation that opens the wrong file. Either way, this is a black-eye for EMC that will continue to get blacker, and is causing many vendors to scramble back to their designs. Fortunately our product is built assuming there will be collisions, thus handles them nicely.

3) Now they admit that the namespace only supports up to 400M objects. Boy, two black eyes in one day. Anthony had a good calculation..."240,000 emails/hour or 5.76M emails/day or 11.52M Objects/day will exceed Centera 400M limit in 34.72 days". Now it should be clear why EMC is asking customers to containerize their emaiul records to reduce file count.

If you really want a glimpse into the day-to-day issues of Centera, check out the CAS Yahoo group at http://groups.yahoo.com/group/CASTechGroup/.

( Feb 09 2005, 03:05:59 PM PST ) Permalink Comments [1]

Content Addressable Storage (CAS) CAS is an interesting new approach to storage [well actually it's been around a while, just not broadly commercialized] where a crypto hash "checksum" is calculated for every file that's stored. The hash algorithm might be the lately maligned 80-bit MD5 or a 160-bit SHA1 or something different. The adjective "Content Addressable" is relatively misleading since it implies that the hash is used as a file handle. Yet there are no systems that work like that today, including Centera for which the CAS acronym was invented. Filepool, Scale8, and a variety of other researchy SSP-oriented storage designs did use hashes for internally-facing file handles, but none of those things are commercially deployed today afaik.
For file handles...

  • Centera uses something called a "C-clip", which combines the object hash with a metadata identifier.
  • Sun's CIS uses MD5's for auditing and compliance, but a conventional file system and filename scheme for naming files.
  • Honeycomb uses yet something else.

So CAS is a loose term for things for things that
a) are WORM systems and
b) calculate a hash somewhere along the way.

Why use hashes? Well the real reason is that it simplifies the design for a storage system intended to be scalable by eliminating the need for distributed lock management across a clustered system. The serendipitous benefit is that it typically means objects are immutable, something useful for SEC-regulated applications.

So the term CAS is not particularly accurate, is aligned pretty strongly with EMC, and is not particularly benefits oriented so their term will die a slow death over the next couple years. In fact SNIA has already changed the naming of their respective committee away from CAS.

( Feb 09 2005, 02:48:43 PM PST ) Permalink Comments [0]

20050207 Monday February 07, 2005

Airplane wrecks Spend some time researching local aviation disasters. There's something very emotional about standing on the site of an aviation disaster. It's amazing how little can be left of a large airliner that hits the ground at 200 mph. The fragility of those craft is something to respect. The biggest bay-area disaster was the United DC-6 that hit Tolman Peak near Fremont in the 60's. I was up there but couldn't find any trace. I have 2 friends that have crashed their aircraft, a third who died. Dave used his parachute over Altamont, broke a few vertibrae. Lynn stall-spun on the test flight of an experimental racer. Wayne was filming a movie and turned his ag-cat into the wrong canyon. Several other friends have landing-light souveniers in their hangars. Bill cartwheeled spectacularly at the Moffet show last year and walked away from it (god bless Curtis Pitts). My closest call was aileron flutter over TCY that broke the wing-attach, bent 2 pushrods, and cracked 2 spars. Good thing the runway was 3000 ft away ;-) ( Feb 07 2005, 09:00:54 AM PST ) Permalink Comments [2]

In the beginning So here we go with our first shot at blogging....Hmmm I don't feel any different. ( Feb 07 2005, 08:56:31 AM PST ) Permalink Comments [0]


Archives
Language
Links
Referrers