I was surprised to learn that a lot of people responded to a 10-07 de-duplication survey to say that they didn't even know what de-dupe was. So I thought I'd write some blog entries on that. This first one is a basic primer on the subject.
A de-duplication system identifies and eliminates redundant blocks of data, significantly reducing the amount of disk needed to store said data. It looks at the data on a sub-file (i.e.block) level, and attempts to determine if it’s seen the data before. If it hasn’t, it stores it. If it has seen it before, it ensures that it is stored only once, and all other references to that data are merely pointers. (Content Addressable Storage, or CAS, is basically file-level de-duplication. True de-duplicaiton goes to the sub-file level, noticing blocks in common between different versions of the same file.)
Examples of obvious duplicate data that a de-duplication system would store only once are listed below.
- The same file backed up from five different servers
- A weekly full backup when only 5% has changed (95% would be duplicate blocks from last week)
- A daily full backup of a database that doesn’t support incremental backups (most of it would be duplicate blocks from the day before)
- Incremental backups of files that change every day, such as a spreadsheet that gets updated every day. Only the changes would get stored every day.
Consider a typical data center with a mix of database and filesystem data that performs weekly full backups and daily incremental backups. A de-duplication system could reduce the amount of storage needed to store its backups by 20:1 or even more. (Those performing monthly full backups or using TSM's progressive incremental features will see a lower de-duplication ratio — but will still get plenty of benefit from a de-duplication system. Please note that only two of the four examples of duplicated data above refer to full backups.)
An effective price of less than $1/GB can make a de-duplication disk system around the same price as a similarly sized tape library. This enables customers to store data on disk that they previously would have stored on tape, while still experiencing all of the features that drew them to disk in the first place.
Data can be de-duplicated at the target or source. A system that de-duplicates at the target, such as a VTL or NAS disk taret, allows you to continue using your current backup software and still benefit from de-dupe. The backup system continues to operate as it always has, and the target identifies and eliminates redundant data sent to it by the backup system. This saves storage space, but not bandwidth, as duplicated data is still being sent across the network.
To use de-duplication at the source, you must install backup client software from a de-duplication software vendor. That client then communicates with a backup server running the same software. If the client and server determine that data on the client has already been stored in the backup server, that data isn’t even sent to the backup server, saving both disk space and network bandwidth.
I'll make some more blog entries about de-dupe, but this one should give you an overview if you've never been sure exactly what it is.
This blog entry is in a series. The next entry in the series is Two different types of de-duplication .
----- Signature and Disclaimer -----
Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technologist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.