Why good dedupe is important — and hard to do

Good dedupe creates real savings in disk and bandwidth requirements.  It also makes the impossible possible by replicating even full backups offsite. Dedupe is behind many advancements in backup and replication technology over the last decade or so.

duplicate data

What is dedupe?

Dedupe is the practice of identifying and eliminating duplicate data. It’s most common in backup technology but is now being implemented in primary storage as well.  Just as dedupe made disk more affordable as a backup target, dedupe makes flash more affordable as a primary storage target.  It also changes the economics of secondary storage for reference data.

The most common method of accomplishing this is to first chop up the data into chunks, which are analogous to blocks. We use the terms chunks, because blocks typically imply a fixed size of something like 8K. Dedupe systems often use pieces of variable size, so the term chunk was coined to refer to a piece of data to be compared.

The chunk is then run through a cryptographic algorithm such as SHA-256. Initially intended for security reasons, such algorithms produce a unique value for each chunk of data. In the case of SHA-256 (AKA SHA-2), it creates a 256-bit value we call a hash. If two chunks of data have the same hash, they are considered identical and one is discarded.

The more redundant data you can identify, the more money you can save and the faster you can replicate data across the network.  So what kinds of things make effective dedupe?

True global dedupe

The more data you can compare, the more duplicate data you are likely to find. Many dedupe systems create data pools that do not talk to each other and thus significantly reduce their effectiveness.

Some dedupe systems only look for duplicate data contained within the backups of a single system, for example.  They do not compare the files backed up from Apollo to the files backed up from Elvis. If your company has multiple email servers, for example, there is a very high chance of duplicate data across them, as many people will send the same attachment to several people that may hosted on different email systems. If you’re backing up endpoints such as laptops, the chances of duplicate data are significant.

On the opposite end of the backup equation are backup appliances. Target dedupe appliances — even the most well-known ones — typically compare data stored on an individual appliance. The dedupe is not global across all appliances.  Each target dedupe appliance is a dedupe silo.

This is also true when using different backup systems. If you are using one backup system for your laptops, another to backup Office365, and another to back up your servers, you are definitely creating dedupe silos as well.

A truly global dedupe system would compare all data to all other data. It would compare files on a mobile phone to attachments in emails. It would compare files on the corporate file server to files stored on every laptop.  It would identify a single copy of the Windows or Linux operating system and ignore all other copies.

Dedupe before backup

The most common type of dedupe today is target appliance dedupe, and it’s absolutely less effective than deduping at the source. The first reason it’s less effective is that it requires a significant amount of horsepower to crack the backup algorithm and look at the actual data being backed up. Even then, it’s deduping chunks of backup strings, instead of chunks of actual files. It’s deducing the underlying data rather than actually looking at it.  The closer you’re getting to the actual files, the better dedupe you’re going to get.

The second reason its less effective is that you spend a lot of CPU time, I/O resources, and network bandwidth transferring data that will eventually be discarded. Some dedupe appliances have recognized this issue and created specialized drivers that try to dedupe the data before it’s sent to the dedupe appliance, which validates that the backup client is the best to dedupe data.

The final reason why dedupe should be done before it reaches an appliance is that when you buy dedupe appliances, you pay for the dedupe multiple times. You pay for it in the initial dedupe appliance, and you may pay extra for the ability to dedupe before the data gets to the appliance. If you replicate the deduped data, you have to replicate it to another dedupe appliance that costs as much as the initial one.

Application-aware dedupe

Another reason to dedupe before you back up is that at the filesystem layer the backup software can actually understand the files its backing up. It can understand that it’s looking at a Microsoft Word document, or a SQL Server backup string. If it knows that, it can create slice and dice the data differently based on its data type.

For example, did you know that Microsoft Office documents are actually ZIP files?  Change a .docx extension to .zip and double-click it.  It will open up as a zip file. A dedupe process running at the filesystem layer can do just that and can look at the actual contents of the zip file, rather than looking at a jumble of chunks of data at the block layer.

How much can you actually save?

Money

I try to keep my blogs on backupcentral.com relatively agnostic, but in this one I feel compelled to use my employer (Druva) as an example of what I’m talking about. I remember seven years ago watching Jaspreet Singh, the CEO of Druva, introduce Druva’s first product to the US.  He talked about how good their dedupe was, and I remember thinking “Yea, yea… everybody says they have the best dedupe.”  Now that I’ve seen things on the inside, I see what he was talking about.

I’ve designed and implemented many dedupe systems throughout the years. Based on that experience, I’m comfortable using the 2X rule of thumb. Meaning that if you have a 100 TB datacenter, your dedupe system is going to need at least 200 TB of disk capacity to back it up with any kind of retention.

For clarification, when I say 100 TB, I’m talking about the size of a single full backup, not the size of all the backups.  A typical environment might create 4000 TB of backup data from a 100 TB datacenter, which gets deduped to 200 TB.  That’s why a good rule of thumb is to start with 2X the size of your original environment.

Imagine my surprise when i was told that the Druva rule of thumb was .75X.  Meaning that in order to backup 100 TB of data with a year of retention, Druva would need only 75 TB of disk capacity. That’s less than the size of a single full backup!

Since Druva customers only pay each month for the amount of deduped data that the product stores, this means that their monthly bill is reduced by more than half (62%.)  Instead of paying for 200 TB, they’re paying for 75 TB.   Like I said, good dedupe saves a lot of money and bandwidth.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

5 comments
  • Can you expand on this, as it doesnt make sense to me at the moment? Where do you get 4000TB from?

    “For clarification, when I say 100 TB, I’m talking about the size of a single full backup, not the size of all the backups. A typical environment might create 4000 TB of backup data from a 100 TB datacenter, which gets deduped to 200 TB. That’s why a good rule of thumb is to start with 2X the size of your original environment.”

    • You can get to 4000 TB when you take into account the “streamed” data to the backup system. If there is 100 TB of protected data, and there’s a 10% daily change rate, each day you’re backing up another 10 TB of changed information. You then keep that for about 15 days, and then every week you take another Full (another 100 TB), of which 90% of it is deduped so you’re still only doing about 10 TB for that backup. You keep that Weekly Full for 5 weeks and once a month you take a Monthly Full, again 10 TB. So, that’s 100 TB on day 1 and 10 TB per day for the next 30 days == 400 TB. Replicate that monthly for a single year and now you have 4,800 TB of streamed data.

      • That’s what I was talking about. My math’s a bit different than yours, though. Most people store their backups far longer than 15 days – way longer. Usually three months or more at least. So that’s 12 weekly fulls w/90 days of incrementals. You easily get to 20X the original amount of data protected. I’ve used 20X as a my rule of thumb for many years.

        And, of course, I believe that dedupe should happen at the source, so that 4800 is never replicated in the first place. You never again take a full, and your incremental backups are block-level unique-only data.