Wednesday February 09, 2005 | SysBlog Notes from Storage R&D |
|
What's Up With EMC Centera?? A couple interesting articles just came out... 1) We've known from the beginning that Centera was pretty slow, but that was typically ok for the archival-oriented applications it was serving. We think this has mostly to do with the asymmetry of the clustered architecture, and something that they can only incrementally improve through the use of faster Linux boxes. 2)Then we started hearing about hash collisions in the field causing data loss. The statisticians have always reported that the odds of a hash collision are extremely small (like 1/(2^big number)), but that assumes that data is random. In fact the applications are extremely homegeneous (eg 100 million 10KB check images) and now collisions are happening. Hash collisions can affect the unwary in different ways so it's not clear what mechanism is causing the data loss, perhaps the de-duplication process that deletes redundant versions of what are perceived to be identical files. Perhaps it's a read operation that opens the wrong file. Either way, this is a black-eye for EMC that will continue to get blacker, and is causing many vendors to scramble back to their designs. Fortunately our product is built assuming there will be collisions, thus handles them nicely. 3) Now they admit that the namespace only supports up to 400M objects. Boy, two black eyes in one day. Anthony had a good calculation..."240,000 emails/hour or 5.76M emails/day or 11.52M Objects/day will exceed Centera 400M limit in 34.72 days". Now it should be clear why EMC is asking customers to containerize their emaiul records to reduce file count. If you really want a glimpse into the day-to-day issues of Centera, check out the CAS Yahoo group at http://groups.yahoo.com/group/CASTechGroup/. ( Feb 09 2005, 03:05:59 PM PST ) Permalink Comments [1]
Content Addressable Storage (CAS) CAS is an interesting new approach to storage [well actually it's been around a while, just not broadly commercialized] where a crypto hash "checksum" is calculated for every file that's stored. The hash algorithm might be the lately maligned 80-bit MD5 or a 160-bit SHA1 or something different. The adjective "Content Addressable" is relatively misleading since it implies that the hash is used as a file handle. Yet there are no systems that work like that today, including Centera for which the CAS acronym was invented. Filepool, Scale8, and a variety of other researchy SSP-oriented storage designs did use hashes for internally-facing file handles, but none of those things are commercially deployed today afaik.
Why use hashes? Well the real reason is that it simplifies the design for a storage system intended to be scalable by eliminating the need for distributed lock management across a clustered system. The serendipitous benefit is that it typically means objects are immutable, something useful for SEC-regulated applications.
So the term CAS is not particularly accurate, is aligned pretty strongly with EMC, and is not particularly benefits oriented so their term will die a slow death over the next couple years. In fact SNIA has already changed the naming of their respective committee away from CAS. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||