De-duplication is not compression

There's a lot of confusion about de-duplication, starting with the thought that it's just compression.  It's not.  Read on to see why not. 

The idea that de-duplication is really just compression re-branded is actually quite commmon.  For example, Robin Harris says in his StorageMojo Blog that the industry needlessly invented a new term:

I still don’t get why the industry refers to “de-duplication” rather than compression – why use a well-understood term when you can invent a new one? – but they did make the point that compression rates depend on your data types, backup policies and retention policies. Basically the more stuff stays the same the higher your back up compression rate.

While I think Robin is a very bright guy and I enjoy reading StorageMojo, I'm going to have to differ here.  De-dupe is not compression because it works completely different.  If you're curious about how compression works, read this article.  The basic idea is that we identify the most repeated blocks of data, then substitute those blocks with a symbol that takes up less space than the repeated block.  For example, the letter e is repeated in this article dozens of times, and the letter e takes up 8 bits to store.  If we can replace every letter e with something that uses less than 8 bits, we have compressed the stream.  AND compression only works on a file-by-file, backup-by-backup basis.  Each file or stream is looked at to identify the common elements within that stream, and those common elements are replaced with smaller elements that represent them.  You do NOT compare the elements in this file with the elements in any other file.  If you did 10 full backups of the same 10 GB of data, and compressed each of them to 5 GB, you'd have 50 GB of data when you were done.

De-duplication, on the other hand, looks at every block of data and tries to identify if it's a block of data that's been seen before.  If it hasn't been seen before, it stores the entire block of data.  If it has been seen before, it replaces that block of data with a link to the first block of data.  If you, like the previous example, did 10 full backups of the same data, you would have one copy and 9 links, resulting in 10 GB of disk — not 50 GB of disk.  (Many de-dupe products would then compress that 10 GB to 5 GB as well.)

Compression ratios differ based on the type of data you're backing up, and so do de-dupe ratios.  But de-dupe ratios change based on the type of data AND HOW you back up the data.  For example, repeated incremental backups of brand new data (e.g. seismic data) would not de-dupe at all.  They would compress (a little bit), but they won't de-dupe, because very little is common between each backup.  In comparison, if you backup an Exchange database, it's got TONS of duplicated data even within the first full backup, so even one full backup would get de-duped against itself.  (And the resulting backup would be smaller than a compressed backup of the same Exchange database.)  THEN if you did an incremental against the same Exchange database, it would likely contain more copies of the same attachments that were found in the full, and THOSE would get de-duped out.  That wouldn't happen in compression, but it would happen in de-dupe.

This brings to light another misconception: a de-dupe ration of 20:1 will allow you to backup a 20 TB filesystem to 1 TB of disk.  NO.  It will allow you to do 20 backups of your 20 TB filesystem to 20 TB of disk (perhaps less if the 20 TB filesystem has duplicated data inside it).   The first time you do a backup of the 20 TB filesystem, you're likely going to need about 10 TB of disk.  (This takes into account some duplicated data within the filesystem and/or some amount of compression.)  Then each additional full backup would need almost no disk, depending on how many new BLOCKS of data is found in each new backup.

Another question I hear is if de-dupe eliminates redundant blocks of data, how can you de-dupe compressed data?  It's simple.  De-dupe looks at a very large block level, perhaps as big as 64K or 126K.  The bigger the block size, the better the performance.  The smaller the block size, the better the de-dupe ratio.  The de-dupe vendors attempt to strike a balance.  Then they often pass it to compression that looks at the data at a much lower level, where it finds duplicated elements within that 64K or 126K block, and replaces them with a symbol that takes up less space than the common element.

Does this help?  Hopefully you understand now how compression is different than de-dupe.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

3 comments
  • Curtis,

    Good stuff! I realize that the de-dupe ship has sailed and no one is going to call de-dupe compression. My interest is the marketing of new technology: how do you communicate to maximize uptake? My point is that by inventing the term de-dup, the companies hurt themselves.

    Other markets aren’t such purists. MPEG-4 is my favorite example, since it is popularly known as compression, and it is a toolbox of compression techniques, not a single algorithm, which share a lot of similarities with de-dupe technology. De-dupe has more in common with image compression than text compression.

    Nor is de-dupe implemented the same way by the vendors, so it isn’t a single algorithm either. Data Domain has a patent on a technique for figuring out how to split the data into the chunks they use. Diligent does it differently, and if it figures a block is similar enough they’ll delta the two and store the differences. In either case, both techniques look like out-of-order MPEG-4 compression.

    The technology aside, I believe the de-dupe folks set themselves back 12-18 months by inventing a new term for buyers to learn. De-dupe has some wrinkles that you’ve ably pointed out, yet from the perspective of accelerating the product uptake, hardly worth the confusion the industry created for itself.

    Great technology, lousy marketing. I’ll link to you from my post on StorageMojo.

    Robin

  • I decided to write another blog entry in response. I love a good discussion.

  • I apologize for replying to an old post.

    However since the IT field has some relation to computer science it seems counter productive to say de-dup is NOT compression when most science textbooks would tell you it is compression.

    I understand that from a marketing or IT profressional perspective the terms imply different practices, however that does not mean one is not part of the other.

    Just because when you say vehicle most people think car, does not make it correct to say a truck is NOT a vehicle.

    Please feel no anger or offense from me by the use of caps, it is meant for emphasis not attitude.

    Best regards-