Thursday May 24, 2007
Thursday May 24, 2007
Encryption and de-duplication can they co-exist.
These two methodologies that exploit various new technologies/algorithms such as encryption and de-duplication have been worrying me for a while. Individually they sound good, however combined they do not seem to complement each other. Can they complement each other or destroy/negate each other.
For example if we encrypt data being sent to the storage device will de-duplication still work. The de-dupe device receives encrypted data which is rarely unique. Blocks of data that are originally the same but then encrypted will rarely be the same, thus pattern matching will not detect duplicate blocks. Reducing any duplication savings. Could we get de-dupe collisions with two different blocks of data that when encrypted are the same. Thus the de-dupe device stores one of these and delete/loses the other block(s) of data. You get duped.
Various scenarios
1) Encrypt at source, de-dupe fails
If we encrypt at the source or the network and before the de-dupe device then de-dupe fails. It cannot detect duplicate source data, as it is encrypted.
2) De-dupe first, encrypt afterwards
If we do not encrypt at source data is sent in the clear and not secure. If we put de-dupe before the encryption device do we need a de-dupe unit for every server (initiator) and we get de-dupe sprawl.
3) Hybrid, use what works and is available
Encrypting at source requires lots of server CPU resource, taking it away from applications, there may be a solution in the near future. So de-duplicate on a VTL, then store on tape that is encrypted and managed securely with a Key Managemet Station (KMS).
I think the real answer is manage your storage more efficiently then we do not need de-duplication e.g. Prevent the problem rather than cure it once you are suffering. Have storage management policies and an ILM strategy. However, human nature does not work like this so I believe de-duplication will be popular where people do not have time to prevent problems but would rather solve them.
I am sure that the computer industry will solve this dichotomy but at the moment there is a schism. For early adoptors caveat emptor, buyer beware.
If you want to save energy do not use it. If you want to save on storage costs delete it, or store on tape.
For practical purposes, encrypted storage needs to be at the block level - not the entire file level. That is, if I have a 1 TB file I should be able to seek to a block in the middle and use it without having to decrypt the first 500 GB.
In the case where I have two copies of the database (e.g. a production copy and a QA copy) most of the contents will likely be the same but there will be certain blocks that will differ. If the same encryption key, algorithm, etc. are used on the two blocks identical blocks from the two files should have the same ciphertext and dedup should work just fine. Rather fortunately, if you first compressed then encrypted each block, the would both have the same ciphertext as well.
I would love to see some block-level dedup logic in ZFS. See the relevant paragraph at http://mail.opensolaris.org/pipermail/storage-discuss/2007-May/001189.html for details on how I think this could work.
Posted by Mike Gerdts on May 24, 2007 at 04:20 PM CEST #
Posted by Valdis on May 24, 2007 at 05:44 PM CEST #
Posted by selim on August 02, 2007 at 11:55 AM CEST #
I think that dedup collisions at the target shouldn't be an issue. If you are doing target dedup, and two different sources, encrypted by their respective keys, wind up having the same output, and are subsequently deduped and indexed, this is not an issue. (insert side discussion about source key preservation, needed for recovery, here) The big question now becomes - how effective will the dedup be, looking for patterns (the bigger the better) in what effectively should be white noise.
Yes, this is an old thread, but thanks for inspiring a spirited discussion here in the office about encryption and deduplication and pure-geek math ;-)
Posted by Colin on June 06, 2008 at 09:27 AM CEST #