SysBlog
SysBlog
Notes from Storage R&D

20050209 Wednesday February 09, 2005

What's Up With EMC Centera?? A couple interesting articles just came out...
EMC dodges question on Centera performance
Security flaw could put EMC Centera users at risk
Scalability Hampers Large Email Archives

1) We've known from the beginning that Centera was pretty slow, but that was typically ok for the archival-oriented applications it was serving. We think this has mostly to do with the asymmetry of the clustered architecture, and something that they can only incrementally improve through the use of faster Linux boxes.

2)Then we started hearing about hash collisions in the field causing data loss. The statisticians have always reported that the odds of a hash collision are extremely small (like 1/(2^big number)), but that assumes that data is random. In fact the applications are extremely homegeneous (eg 100 million 10KB check images) and now collisions are happening. Hash collisions can affect the unwary in different ways so it's not clear what mechanism is causing the data loss, perhaps the de-duplication process that deletes redundant versions of what are perceived to be identical files. Perhaps it's a read operation that opens the wrong file. Either way, this is a black-eye for EMC that will continue to get blacker, and is causing many vendors to scramble back to their designs. Fortunately our product is built assuming there will be collisions, thus handles them nicely.

3) Now they admit that the namespace only supports up to 400M objects. Boy, two black eyes in one day. Anthony had a good calculation..."240,000 emails/hour or 5.76M emails/day or 11.52M Objects/day will exceed Centera 400M limit in 34.72 days". Now it should be clear why EMC is asking customers to containerize their emaiul records to reduce file count.

If you really want a glimpse into the day-to-day issues of Centera, check out the CAS Yahoo group at http://groups.yahoo.com/group/CASTechGroup/.

( Feb 09 2005, 03:05:59 PM PST ) Permalink Comments [1]

Trackback URL: http://blogs.sun.com/m/entry/what_s_up_with_emc
Comments:

Pure FUD! I work with Centera all the time (Yes, I do work for EMC but I'm not their sock puppet. All my opinions are mine.) and yea.. it can take around 1/2 a second to get to your data on a large CPP install but collision issues are BS. Anyone doing tons of tiny objects uses GM naming which adds a timestamp down to the second to the ID. It's tons faster and eliminates any namespace collision issues. I'm not sure where you got that the namespace only supports 400M objects? That's BS also. A tiny Centera has a limit of that many objects but it's because of a per-node object count limit. The MD5 namespace is wildly huge and even with classic naming the odds of getting a collision on a 400M object Centera are wildly lower than getting a flipped bit on a disk that goes undetected by CRCs in a classic filesystem. Centera's have lost data but never due to namespace collisions as far as I know. Centera is cool stuff but it isn't a filesystem and you can't think of it like one. It's a storage system that handles everything (server load balancing/fail over, redundant storage, redundant retrieval) from the application layer down, automatically, from just about any platform. I've never been a fan of EMC's products (ironic since I now work for them) but this is good stuff. It provides a fundamentally better interface to storage than a standard OS's filesystem API for applications storing fixed content.

Posted by Eric G on March 22, 2005 at 06:49 PM PST #

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

Archives
Language
Links
Referrers