What is deduplication? (updated 6-08)

I was surprised to learn that a lot of people responded to a 10-07 de-duplication survey to say that they didn't even know what de-dupe was.  So I thought I'd write some blog entries on that.  This first one is a basic primer on the subject.

A de-duplication system identifies and eliminates redundant blocks of data, significantly reducing the amount of disk needed to store said data. It looks at the data on a sub-file (i.e.block) level, and attempts to determine if it’s seen the data before. If it hasn’t, it stores it. If it has seen it before, it ensures that it is stored only once, and all other references to that data are merely pointers. (Content Addressable Storage, or CAS, is basically file-level de-duplication.  True de-duplicaiton goes to the sub-file level, noticing blocks in common between different versions of the same file.)

Examples of obvious duplicate data that a de-duplication system would store only once are listed below.

  1. The same file backed up from five different servers
  2. A weekly full backup when only 5% has changed (95% would be duplicate blocks from last week)
  3. A daily full backup of a database that doesn’t support incremental backups (most of it would be duplicate blocks from the day before)
  4. Incremental backups of files that change every day, such as a spreadsheet that gets updated every day. Only the changes would get stored every day.

Consider a typical data center with a mix of database and filesystem data that performs weekly full backups and daily incremental backups. A de-duplication system could reduce the amount of storage needed to store its backups by 20:1 or even more. (Those performing monthly full backups or using TSM's progressive incremental features will see a lower de-duplication ratio — but will still get plenty of benefit from a de-duplication system.  Please note that only two of the four examples of duplicated data above refer to full backups.)

An effective price of less than $1/GB can make a de-duplication disk system around the same price as a similarly sized tape library. This enables customers to store data on disk that they previously would have stored on tape, while still experiencing all of the features that drew them to disk in the first place.

Data can be de-duplicated at the target or source. A system that de-duplicates at the target, such as a VTL or NAS disk taret, allows you to continue using your current backup software and still benefit from de-dupe. The backup system continues to operate as it always has, and the target identifies and eliminates redundant data sent to it by the backup system. This saves storage space, but not bandwidth, as duplicated data is still being sent across the network.

To use de-duplication at the source, you must install backup client software from a de-duplication software vendor. That client then communicates with a backup server running the same software. If the client and server determine that data on the client has already been stored in the backup server, that data isn’t even sent to the backup server, saving both disk space and network bandwidth.

I'll make some more blog entries about de-dupe, but this one should give you an overview if you've never been sure exactly what it is.

 

This blog entry is in a series.  The next entry in the series is Two different types of de-duplication .


Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

2 comments
  • I manage backups in a datacenter and we are looking at the de-dupe technology right now. A big concern we have is around the creating of offsite tapes. We have customers that have regulatory and compliance reasons for creating offsite tapes and storing them. We just reviewed EMC’s Avamar and they really do not have a good answer for creating offsite tape copies. The Avamar Utility station does not control the tape library and still requires another backup application to be involved to create the tape and then to recover from the external tape. We thought this to be very strange and something that we would have to seriously factor into our decision.

    Oh, the claims of de-dupe for Windows are around 688:1!!

  • De-dupe comes in many flavors, shapes and sizes. Don’t discount the concept based on one implementation that didn’t do what you wanted it to. Somebody else may really like how it works.

    The source de-dupe products (such as EMC/Avamar, Symantec Puredisk, Asigra) were designed to provide onsite AND offsite backups without creating tape. You back up using de-dupe backup software to a de-dupe backup server. Offsite copy is provided under one of two scenarios: backup a remote office server to a central datacenter, or replicate the de-dupe backup server to another location. In the first scenario, your first (and only) copy is on disk, but it’s in the central datacenter, not the remote office — so it’s already offsite. In the second scenario, you use the builtin capabilitis of the de-dupe backup software to replicate the de-duped backups to another location. Now you have an onsite and an offsite backup and you haven’t touched a tape.

    NOW, if you don’t like that idea, and you want to create an onsite backup using de-dupe, and create a real tape to hand to your offsite vaulting vendor, then I’d recommend investigating target de-dupe systems, such as an intelligent disk target or virtual tape library. See my later blog entry called “Two different types of de-duplication” to see what I’m talking about.