Monday March 14, 2005 | SysBlog Notes from Storage R&D |
|
CAS is dead, long live CAS! There's lots of interest among customers and suppliers about CAS and how to address the CAS market. All that interest is not misdirected, because the CAS systems that have been marketed to-date have interesting properties and solve customer problems. However, their interesting properties have nothing to do with the fact that they calculate crypto-hash checksums for the files being stored.
There is nothing in these valuable benefits about hashing or hash algorithms. These are the properties of Object Archival Storage Systems, which is a far more appropriate way of describing the breed. Dare I propose a new acronym OASS? Well the SNIA committee charged with standardizing these things is working hard on their own answer, and I'll defer to them. They used to call themselves CAS Solutions Initiative (CASSI), but they too have seen the light. *A point of fact, but none of the CAS systems on the market today are actually CAS. CAS implies that the stored objects are accessed using a hash value computed from the file's contents. But there are not commercially available systems that do this today. For example, EMC's Centera uses a "C-clip" as the object handle, which is an amalgamation of a metadata record and the object hash. Other CAS/OS systems may use other more reliable ways of creating unique object identifiers that have nothing to do with the hash value. So it would seem that the term "CAS" is meaningless, and we all hope it dies. But Object Storage is here to stay due to its propensity to solve unsolved problems of scale, reliability, and TCO. Somewhere in there is a role for computing hash values, but that feature will be less and less visible to customers, especially as Object Storage moves into a primary storage role.
Different Models for strategic asset archival I spent last Tuesday near London participating in a very interesting meeting of the PrestoSpace organization. Prestospace is organized mainly by the good folks at BBC, but it is intended as a knowledge sharing forum for a variety of government and commercial owners of "legacy" media collections. They are putting their heads together to solve some of the challenges of how to take old collections of media (film cannisters for example) and convert them into archives for both preservation and to improve access. What should be made clear is that not all archive applications are the same. In fact, I describe the various implementation as belonging to 4 categories, each of which will impose slightly different requirements on the storage and serving infrastructure: 1) Heritage archives: 2) Compliance archives: 3) Repurposing Archives: 4) Digital Distribution repositories: If we are to derive a specification for a storage system that is going to serve all of these applications, we need to tackle the following problems: So isn't the answer obvious?
Hash cracking and storage Announcement of SHA-1 crypto hash cracking here. Ancient chinese proverb: "It's not about the algorithm, it's about how you use it." Storage vendors are continuing to discount market concerns about hash collisions by saying "the odds of hash collision are infinitesmal". Well, I know a customer with 2B objects in storage. Is 1/2,000,000,000 small enough? Yeah I know, 1/(2^80) or something of that form is the statistical answer. The point is that storage systems have to do better. If there is a non-zero probability of hash collision, then the system must accept and welcome hash collisions! Hashes cannot be used in exclusivity to validate uniqueness of a data object. ( Feb 23 2005, 09:03:35 AM PST ) Permalink Comments [0]
Messaging to the world on Honeycomb Honeycomb is a storage project that started under the management of Bill Joy and Greg Papadopolous in the CTO organization. It started with the assumption that even so-called next-generation storage systems being proposed still don't solve the underlying problems in the large-scale file-storage marketplace. Honeycomb is a collection of hardware and software technologies that solve problems around next generation large-scale “data hungry” applications. That includes better methods for reliability, availability, scaling, and even searching and organizing data. Honeycomb's features were explicitly designed to address the customer problems articulated above. Currently, there is not a technology solution offered that addresses the following customer pain points effectively. Honeycomb is being designed to fill this void in the market. Honeycomb can be deployed as technology components that complement existing NAS products, or even as a standalone storage system. Honeycomb demonstrates Sun's dedication to solving next generation data storage and management problems. It's not about simply beating competition, it's about giving customers strategically powerful data management solutions. LAN-attached storage, inclulding NAS, CAS, HSM, and other file-based services calls upon the ultimate convergence of CPUs, OS, protocols, and networks. All of these things are core competencies of Sun. If we think towards next generation devices, we look to clustering, cryptography, consolidation and grid capabilities, load balancing, database, utility models, and a host of other areas, again all core competencies of Sun. The challenge is to make them all work together to solve unsolved customer problems. From Jonathan down we are committed to making that happen and that's why I work here. I know what you're thinking..."the devil is in the details". Well, the details above are all I can provide until later this year when the NDA covers can be lifted a bit. Stay tuned for more.
What's Up With EMC Centera?? A couple interesting articles just came out... 1) We've known from the beginning that Centera was pretty slow, but that was typically ok for the archival-oriented applications it was serving. We think this has mostly to do with the asymmetry of the clustered architecture, and something that they can only incrementally improve through the use of faster Linux boxes. 2)Then we started hearing about hash collisions in the field causing data loss. The statisticians have always reported that the odds of a hash collision are extremely small (like 1/(2^big number)), but that assumes that data is random. In fact the applications are extremely homegeneous (eg 100 million 10KB check images) and now collisions are happening. Hash collisions can affect the unwary in different ways so it's not clear what mechanism is causing the data loss, perhaps the de-duplication process that deletes redundant versions of what are perceived to be identical files. Perhaps it's a read operation that opens the wrong file. Either way, this is a black-eye for EMC that will continue to get blacker, and is causing many vendors to scramble back to their designs. Fortunately our product is built assuming there will be collisions, thus handles them nicely. 3) Now they admit that the namespace only supports up to 400M objects. Boy, two black eyes in one day. Anthony had a good calculation..."240,000 emails/hour or 5.76M emails/day or 11.52M Objects/day will exceed Centera 400M limit in 34.72 days". Now it should be clear why EMC is asking customers to containerize their emaiul records to reduce file count. If you really want a glimpse into the day-to-day issues of Centera, check out the CAS Yahoo group at http://groups.yahoo.com/group/CASTechGroup/. ( Feb 09 2005, 03:05:59 PM PST ) Permalink Comments [1]
Content Addressable Storage (CAS) CAS is an interesting new approach to storage [well actually it's been around a while, just not broadly commercialized] where a crypto hash "checksum" is calculated for every file that's stored. The hash algorithm might be the lately maligned 80-bit MD5 or a 160-bit SHA1 or something different. The adjective "Content Addressable" is relatively misleading since it implies that the hash is used as a file handle. Yet there are no systems that work like that today, including Centera for which the CAS acronym was invented. Filepool, Scale8, and a variety of other researchy SSP-oriented storage designs did use hashes for internally-facing file handles, but none of those things are commercially deployed today afaik.
Why use hashes? Well the real reason is that it simplifies the design for a storage system intended to be scalable by eliminating the need for distributed lock management across a clustered system. The serendipitous benefit is that it typically means objects are immutable, something useful for SEC-regulated applications.
So the term CAS is not particularly accurate, is aligned pretty strongly with EMC, and is not particularly benefits oriented so their term will die a slow death over the next couple years. In fact SNIA has already changed the naming of their respective committee away from CAS. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||