Why good dedupe is important — and hard to do

Good dedupe creates real savings in disk and bandwidth requirements.  It also makes the impossible possible by replicating even full backups offsite. Dedupe is behind many advancements in backup and replication technology over the last decade or so.

duplicate data

What is dedupe?

Dedupe is the practice of identifying and eliminating duplicate data. It’s most common in backup technology but is now being implemented in primary storage as well.  Just as dedupe made disk more affordable as a backup target, dedupe makes flash more affordable as a primary storage target.  It also changes the economics of secondary storage for reference data.

The most common method of accomplishing this is to first chop up the data into chunks, which are analogous to blocks. We use the terms chunks, because blocks typically imply a fixed size of something like 8K. Dedupe systems often use pieces of variable size, so the term chunk was coined to refer to a piece of data to be compared.

The chunk is then run through a cryptographic algorithm such as SHA-256. Initially intended for security reasons, such algorithms produce a unique value for each chunk of data. In the case of SHA-256 (AKA SHA-2), it creates a 256-bit value we call a hash. If two chunks of data have the same hash, they are considered identical and one is discarded.

The more redundant data you can identify, the more money you can save and the faster you can replicate data across the network.  So what kinds of things make effective dedupe?

True global dedupe

The more data you can compare, the more duplicate data you are likely to find. Many dedupe systems create data pools that do not talk to each other and thus significantly reduce their effectiveness.

Some dedupe systems only look for duplicate data contained within the backups of a single system, for example.  They do not compare the files backed up from Apollo to the files backed up from Elvis. If your company has multiple email servers, for example, there is a very high chance of duplicate data across them, as many people will send the same attachment to several people that may hosted on different email systems. If you’re backing up endpoints such as laptops, the chances of duplicate data are significant.

On the opposite end of the backup equation are backup appliances. Target dedupe appliances — even the most well-known ones — typically compare data stored on an individual appliance. The dedupe is not global across all appliances.  Each target dedupe appliance is a dedupe silo.

This is also true when using different backup systems. If you are using one backup system for your laptops, another to backup Office365, and another to back up your servers, you are definitely creating dedupe silos as well.

A truly global dedupe system would compare all data to all other data. It would compare files on a mobile phone to attachments in emails. It would compare files on the corporate file server to files stored on every laptop.  It would identify a single copy of the Windows or Linux operating system and ignore all other copies.

Dedupe before backup

The most common type of dedupe today is target appliance dedupe, and it’s absolutely less effective than deduping at the source. The first reason it’s less effective is that it requires a significant amount of horsepower to crack the backup algorithm and look at the actual data being backed up. Even then, it’s deduping chunks of backup strings, instead of chunks of actual files. It’s deducing the underlying data rather than actually looking at it.  The closer you’re getting to the actual files, the better dedupe you’re going to get.

The second reason its less effective is that you spend a lot of CPU time, I/O resources, and network bandwidth transferring data that will eventually be discarded. Some dedupe appliances have recognized this issue and created specialized drivers that try to dedupe the data before it’s sent to the dedupe appliance, which validates that the backup client is the best to dedupe data.

The final reason why dedupe should be done before it reaches an appliance is that when you buy dedupe appliances, you pay for the dedupe multiple times. You pay for it in the initial dedupe appliance, and you may pay extra for the ability to dedupe before the data gets to the appliance. If you replicate the deduped data, you have to replicate it to another dedupe appliance that costs as much as the initial one.

Application-aware dedupe

Another reason to dedupe before you back up is that at the filesystem layer the backup software can actually understand the files its backing up. It can understand that it’s looking at a Microsoft Word document, or a SQL Server backup string. If it knows that, it can create slice and dice the data differently based on its data type.

For example, did you know that Microsoft Office documents are actually ZIP files?  Change a .docx extension to .zip and double-click it.  It will open up as a zip file. A dedupe process running at the filesystem layer can do just that and can look at the actual contents of the zip file, rather than looking at a jumble of chunks of data at the block layer.

How much can you actually save?


I try to keep my blogs on backupcentral.com relatively agnostic, but in this one I feel compelled to use my employer (Druva) as an example of what I’m talking about. I remember seven years ago watching Jaspreet Singh, the CEO of Druva, introduce Druva’s first product to the US.  He talked about how good their dedupe was, and I remember thinking “Yea, yea… everybody says they have the best dedupe.”  Now that I’ve seen things on the inside, I see what he was talking about.

I’ve designed and implemented many dedupe systems throughout the years. Based on that experience, I’m comfortable using the 2X rule of thumb. Meaning that if you have a 100 TB datacenter, your dedupe system is going to need at least 200 TB of disk capacity to back it up with any kind of retention.

For clarification, when I say 100 TB, I’m talking about the size of a single full backup, not the size of all the backups.  A typical environment might create 4000 TB of backup data from a 100 TB datacenter, which gets deduped to 200 TB.  That’s why a good rule of thumb is to start with 2X the size of your original environment.

Imagine my surprise when i was told that the Druva rule of thumb was .75X.  Meaning that in order to backup 100 TB of data with a year of retention, Druva would need only 75 TB of disk capacity. That’s less than the size of a single full backup!

Since Druva customers only pay each month for the amount of deduped data that the product stores, this means that their monthly bill is reduced by more than half (62%.)  Instead of paying for 200 TB, they’re paying for 75 TB.   Like I said, good dedupe saves a lot of money and bandwidth.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

No such thing as a “Pay as you go” appliance

Pay as you goI’ve never seen an appliance solution that I would call “pay as you go.”  I might call it “pay as you grow,” but never “pay as you go.”  There is a distinct difference between the two.

What is “pay as you go?”

I’ll give you a perfect example.  BackupCentral.com runs on a Cpanel-based VM. Cpanel can automatically copy the backups of my account to an S3 account.   I blogged about how to do that here.

I tell Cpanel to keep a week of daily backups, four weeks of weekly backups, and 3 months of monthly backups.  A backup of backupcentral.com is about 20 GB, and the way I store those backups in S3, I have about fifteen copies.  That’s a total of about 300 GB of data I have stored in Amazon S3 at any given time.

Last time I checked, Amazon bills me about $.38/month.  If I change my mind and decrease my retention, my bill drops.  If I told Cpanel to not store the three monthly backups, my monthly bill would decrease by about 20%.  If I told it to make it six months of retention, my monthly bill would increase by about 20%.

What is “pay as you grow?”

Pay as you grow

Instead of using S3 — which automatically ensures my data is copied to three locations — I could buy three FTP servers and tell Cpanel to back up to them. I would buy the smallest servers I could find. Each server would need to be capable of storing 300 GB of data.  So let’s say I buy three servers with 500 GB hard drives, to allow for some growth.

Time will pass and backupcentral.com will grow.  That is the nature of things, right?  At some point, I will need more than 500 GB of storage to hold backupcentral.com.  I’ll need to buy another hard drive to go into each server and install that hard drive.

Pay as you grow always starts with a purchase of some hardware –– more than you need at the time.  This is done to allow for some growth.  Typically you buy enough hardware to hold three years of growth.  Then a few years later when you outgrow that hardware, you either replace it with a bigger one (if it’s fully depreciated) or you grow it by adding more nodes/blocks/chunks/bricks/whatever.

Every time you do this, you are buying more than you need at that moment, because you don’t want to have to keep buying and installing new hardware every month.  Even if the hardware you’re buying is the easiest to buy and install hardware in the world, pay as you grow is still a pain, so you minimize the number of times you have to do it. And that means you always buy more than you need.

What’s your point, Curtis?

The company I work (Druva) for has competitors that sell “pay as you grow” appliances, but they often refer to them as “pay as you go.”  And I think the distinction is important. All of them start with selling you a multi-node solution for onsite storage, and (usually) another multi-node solution for offsite storage. These things cost hundreds of thousands of dollars just to start backing up a few terabytes.

It is in their best interests (for multiple reasons) to over-provision and over-sell their appliance configuration.  If they do oversize it, nobody’s going to refund your money when that appliance is fully depreciated, and you find out you bought way more than you needed for the least three or five years.

What if you under-provision it?  Then you’d have to deal with whatever the upgrade process is sooner than you’d like.  Let’s say you only buy enough to handle one year of growth.  The problem with that is now you’re dealing with the capital process every year for a very crucial part of your infrastructure.  Yuck.

In contrast, Druva customers never buy any appliances from us.  They simply install our software client and start backing up to our cloud-based system that runs in AWS.  There’s no onsite appliance to buy, nor do they need a second appliance to get the data offsite.(There is an appliance we can rent them to help seed their data, but they do not have to buy it.) In our design, data is already offsite.  Meanwhile, the customer only pays for the amount of storage they consume after their data has been globally deduplicated and compressed.

In a true pay as you go system, no customer ever pays for anything they don’t consume. Customers often pay up front for future consumption, just to make the purchasing process easier.  But if they buy too much capacity, anything they paid for in advance just gets applied to the next renewal.  There is no wasted capacity, no wasted compute.

In one mode (pay as you grow)l you have wasted money and wasted power and cooling while your over-provisioned system sits there waiting for future data.  In the other model (pay as you go), you pay only for what you consume — and you have no wasted power and cooling.

What do you think?  Is this an important difference?


----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Bandwidth: Backup Design Problem #2

Getting enough bandwidth is the second challenge of designing and maintaining a traditional backup system. Tape is the first challenge, which is solved by not using it for operational backups.  The next challenge, however, is getting enough bandwidth to get the job done.


This is a major problem with any backup software product that does occasional full backups, which is most of the products running in today’s datacenter. Products that do full-file incremental backups also have this problem, although to a lesser degree.  (A full-file incremental backup is one that backs up an entire file when even one byte has changed.)

This is such a problem that many people would agree with the statement that backups are the thing that test your network system more than anything else. This is one of the main reasons people run backups at night.

This problem has been around for a long time.  I remember one time I was testing backups over the weekend, and accidentally set things up for backups to kick off at 10 AM the next day — which happened to be Monday. The network came to a screeching halt that day until we figured out what was happening and shut the backups off.

Backup system admins spend a lot of time scheduling their backups so they even out this load.  Some perform full backups only on the weekend, but this really limits the overall capacity of the system.  I prefer to perform 1/7th of the full backups each night if I’m doing weekly full backups, or 1/28th of the full backups each night if I’m doing monthly full backups.

While this increases your system capacity, it also requires constant adjustment to even the full backups out, as the size of systems changes over time. And once you’ve divided the full backups by 28 and spread them out across the month, you’ve created a barrier that you will hit at some point. What do you do when you’re doing as many full backups each night as you can? Buy more bandwidth, of course.

How has this not been fixed?


Luckily this problem has been fixed. Products and services that have switched to block-level incremental-forever backups need significantly less bandwidth than those that do not use such technology.   A typical block-level incremental uses over 10 times less bandwidth than typical incremental backups, and over 100 times less bandwidth than a typical full backup.

Another design element of modern backup products and services is that they use global deduplication, which only backs up blocks that have changed and haven’t been seen on any other system. If a given file is present on multiple systems, it only needs to be backed up from one of them. This significantly lowers the amount of bandwidth needed to perform a backup.

Making the impossible possible



Lowering the bandwidth requirement creates two previously unheard-of possibilities: Internet-based backups and round-the-clock backups. The network impact of globally deduplicated, block-level incremental backups is so small that the data can be transferred over the Internet for many environments.  In addition, the impact on the network is so small that backups can often be done throughout the day.  And all of this can be done without all of the hassle mentioned above.

The more a product identify blocks that have changed, and the more granular and global the deduplication can be designed, the more these things become possible. One of the best ways to determine how efficient a backup system is on bandwidth is to ask them how much storage is needed to store 90-180 days of backups. There is a direct relationship between that number and the amount of bandwidth you’re going to need.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.