Data Processing
Valdis's Weblog
Archives
« December 2009
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today
Click me to subscribe
Search

Links
 

Today's Page Hits: 54

Locations of visitors to this page
« 2 liters of beer in... | Main | My environmental... »
Thursday May 24, 2007
The dupe in de-duplication

Encryption and de-duplication can they co-exist.

These two methodologies that exploit various new technologies/algorithms such as encryption and de-duplication have been worrying me for a while. Individually they sound good, however combined they do not seem to complement each other. Can they complement each other or destroy/negate each other.

For example if we encrypt data being sent to the storage device will de-duplication still work. The de-dupe device receives encrypted data which is rarely unique. Blocks of data that are originally the same but then encrypted will rarely be the same, thus pattern matching will not detect duplicate blocks. Reducing any duplication savings. Could we get de-dupe collisions with two different blocks of data that when encrypted are the same. Thus the de-dupe device stores one of these and delete/loses the other block(s) of data. You get duped.

Various scenarios

1) Encrypt at source, de-dupe fails

If we encrypt at the source or the network and before the de-dupe device then de-dupe fails. It cannot detect duplicate source data, as it is encrypted.

2) De-dupe first, encrypt afterwards

If we do not encrypt at source data is sent in the clear and not secure. If we put de-dupe before the encryption device do we need a de-dupe unit for every server (initiator) and we get de-dupe sprawl.

3) Hybrid, use what works and is available

Encrypting at source requires lots of server CPU resource, taking it away from applications, there may be a solution in the near future. So de-duplicate on a VTL, then store on tape that is encrypted and managed securely with a Key Managemet Station (KMS).

I think the real answer is manage your storage more efficiently then we do not need de-duplication e.g. Prevent the problem rather than cure it once you are suffering. Have storage management policies and an ILM strategy. However, human nature does not work like this so I believe de-duplication will be popular where people do not have time to prevent problems but would rather solve them.

I am sure that the computer industry will solve this dichotomy but at the moment there is a schism. For early adoptors caveat emptor, buyer beware.

If you want to save energy do not use it. If you want to save on storage costs delete it, or store on tape.

Posted at 02:56PM May 24, 2007 by Valdis Filks in Technical  |  Comments[4]

Comments:

For practical purposes, encrypted storage needs to be at the block level - not the entire file level. That is, if I have a 1 TB file I should be able to seek to a block in the middle and use it without having to decrypt the first 500 GB.

In the case where I have two copies of the database (e.g. a production copy and a QA copy) most of the contents will likely be the same but there will be certain blocks that will differ. If the same encryption key, algorithm, etc. are used on the two blocks identical blocks from the two files should have the same ciphertext and dedup should work just fine. Rather fortunately, if you first compressed then encrypted each block, the would both have the same ciphertext as well.

I would love to see some block-level dedup logic in ZFS. See the relevant paragraph at http://mail.opensolaris.org/pipermail/storage-discuss/2007-May/001189.html for details on how I think this could work.

Posted by Mike Gerdts on May 24, 2007 at 04:20 PM CEST #

I assume lots of things that I should not, but that would make these very blogs even longer. Anyway, I agree encryption needs to be at the block level. Now from my security days, if you encrypt 2 identical blocks which after encryption are again exactly the same. You do not have a strong encryption technique as it can be cracked easier as you are creating a pattern. Which can be easier to hack. Bottom line if after encryption you have identical blocks how does the de-dupe know which block was originally a duplicate of another block. Any identifier to signify this weakens the encryption.

Posted by Valdis on May 24, 2007 at 05:44 PM CEST #

don't forget there are 2 types of deduplication. block-level one might coexist with encryption check this link out http://www.backupcentral.com/content/view/129/47/

Posted by selim on August 02, 2007 at 11:55 AM CEST #

I think that dedup collisions at the target shouldn't be an issue. If you are doing target dedup, and two different sources, encrypted by their respective keys, wind up having the same output, and are subsequently deduped and indexed, this is not an issue. (insert side discussion about source key preservation, needed for recovery, here) The big question now becomes - how effective will the dedup be, looking for patterns (the bigger the better) in what effectively should be white noise.

Yes, this is an old thread, but thanks for inspiring a spirited discussion here in the office about encryption and deduplication and pure-geek math ;-)

Posted by Colin on June 06, 2008 at 09:27 AM CEST #

Post a Comment:
  • HTML Syntax: NOT allowed