Why global dedupe matters

Note to Readers: I have updated thoughts on this topic and you can see them here.

My previous post on dedupe performance illustrated the impact that global dedupe has on the effective performance of different dedupe appliances. I received a lot of comments from vendors that didn’t have global dedupe saying one of two things. One thing they would say is that the vendors that claimed to have global dedupe didn’t really have it. I know too much to believe that. The other thing they’d say is that global dedupe wasn’t as important as I was making it out to be. Well, that’s definitely not the case, and that’s what this post is all about. Click Read More to see why I think global dedupe is critical for larger environments.

First let me say that, despite what some vendors seem to think, global dedupe is real (FalconStor, IBM, NEC, SEPATON all have it) and if a vendor really has it, then their 2, 4, or 8 systems really do behave like one from a dedupe perspective. If you don’t have it (DD, EMC, Exagrid, Quantum don’t have it yet), then you cannot add your systems together and call them one system (from a dedupe perspective). Unfortunately, looking at systems this way means cutting some vendors’ advertised numbers in half (EMC 4000 & NetApp VTL) or by 1/16th (Data Domain DDX). Data Domain, NetApp, and I speak all the time and I believe they understand what they have and don’t have, and they understand my position on their numbers. EMC and I, on the other hand, have talked, but haven’t talked about the global dedupe “controversy,” so it’s no surprise that some of their employees are reacting strongly to a post that cuts their advertised numbers in half, and multiplies other vendors’ numbers by eight.

And now the good news: if you’re backing up less than 10-20 TB per day (including weekends) from your entire data center, global dedupe is not a problem you need to worry about. 10-20 TB a night requires a throughput 230 to 460 MB/s if you have a 12-hour backup window. If you consult the table in the performance post, you’ll find that almost all of the vendors can handle that level of throughput with a single node; therefore, global dedupe is not something you need to worry about. However, if you’re backing up significantly more than 10-20 TB every night, global dedupe is a required feature, in my opinion. Let’s talk about why.

First let me say why it doesn’t matter. It’s not that vendors with global dedupe will get better dedupe ratios by comparing everything to everything. In other words, I’m not saying that FalconStor, IBM, or SEPATON will get a higher dedupe ratio because they compare the blocks in Oracle to the blocks in Exchange or the filesystem. While there may be a few common blocks between some of them, that is not where the bulk of duplicate data comes from. Most duplicate data is found between repeated backups of the same object. For example, there will obviously be many duplicated blocks of data in two full backups of Exchange, Oracle, any database, or any filesystem. There will also be many duplicate blocks between multiple incremental backups of an object, as incremental backups are rarely block-level incremental backups; they back up an entire databases extent if one block has changed, or back up an entire file if you changed one block in it. Therefore, as long as you’re always comparing the latest backup of a given Exchange database with the previous backup of that same Exchange database, you should eliminate most duplicate data.

I can hear the people in Santa Clara, Irvine, or Hopkinton yelling, “Well, that’s what we do! If you’ve got two nodes in a cluster, you always send one set of backups to one node and another set of backups to the other node. If you’ve got sixteen nodes in an array, you divvy up your backups into sixteen equal chunks and direct each of those chunks to one of the sixteen nodes in the array. That way, backups of Exchange-server-A are always compared to previous backups of Exchange-server-A. What’s the problem?”

The problem is that this asks the customer to do something they never had to do before. With tape or non-deduped-disk, they simply sent the backups to any device they could find; they didn’t need to worry about where it went. It didn’t matter where they went; it only matters that the backup software product knows where they went. You could load-balance across as many devices and as many types of devices as you needed to use with no ill effects.

Now they have to create equal-sized portions of their backups and send them to a single device. The first challenge with that is that many customers often don’t know how big their backups are. They usually can tell you that they use eight tapes per night, and that this roughly translates into two terabytes of backups. How big is each backup? They might know how big some of the backups are, such as that one really large database they have to back up. But the size of most of their backups will be a mystery, and solving that mystery takes a lot of research. They either need to be good at querying their backup product and interpreting the results (this can actually be quite hard), or they need an advanced reporting tool — something many environments don’t have.

The next challenge is what I’ll call related backups. While there aren’t a lot of duplicate blocks between Oracle and Exchange, there may be duplicate blocks between several different Exchange databases, or several different related filesystems. The best thing from a dedupe perspective (if you don’t have global dedupe) is to make sure all Exchange backups go to the same node. What if they won’t fit on the node, either capacity or throughput-wise? Splitting those related backups across two nodes will decrease your dedupe ratio, which another way to say increase your cost.

Let’s suppose that you are able to properly group together related backups, and to create several equal-sized backup portions to divvy up amongst several dedupe appliances. Wouldn’t local dedupe be OK then? Sure – for a while. But things never remain the same. Systems grow at different rates, and at some point some systems will become too big to fit into a single dedupe appliance. This means that the difficult process of creating equally sized backup portions is an ongoing one, adding to the operational cost of your systems. If you had global dedupe, there is no need for this process during deployment – and certainly no need to do this on a regular basis.

Availability is the next challenge because you’re pointing backups to a single device. What happens if that device becomes unavailable? Your backups stop, that’s what happens. You could load-balance your backups across two devices to solve this problem, but then you would ruin your dedupe ratio. If one night a given server is backed up to one node, and the next night it’s backed up to another node, its data will get stored twice — exactly the thing you didn’t want to happen. This decreases your dedupe ratio (AKA increases your cost). Therefore, you’re stuck with pointing your backups to a single device and having no failover or load balancing.

The bigger your servers are, the more important load-balancing becomes. In fact, it can actually save you money. Suppose you had seven 20-TB servers to back up and a 12-hour backup window. Assuming you were doing a weekly full backup, the first thing you would do is schedule a full backup of one node every night, and you would want to do an incremental backup of the other six nodes each night as well. An 5-10% incremental would create an additional load of 1-2 TB per node per night, for a total load of 6-12 TB of additional backup traffic – creating an total load of 26-32 TB per night, requiring a throughput of 700–888 MB/s for a 12-hour backup window.

Now suppose your dedupe appliance can only handle 500 MB/s. What must you do? If it were a non-dedupe device, you would just buy a second one and load-balance across the two, right? You’d have a total throughput capacity of 1000 MB/s and would have room to spare. The same would be true if you had two nodes of a dedupe system with global dedupe. However, if you don’t have global dedupe, you’d have to buy seven nodes instead of two. What? You can’t load balance, and you can’t send your full backups to one and your incremental backups to another because the dedupe ratio would be significantly reduced. What you need is to use a separate dedupe appliance for each server. If you then scheduled things so one full backup ran per night, what you’d have is one box being used to capacity (500 MB/s), and six boxes that were barely breaking a sweat, as they would only receive 1-2 TB of backups per night. What a waste.

The next, albeit rare, challenge is giant servers. What happens when the backup of a single server needs more throughput or capacity than a single node can provide? I know of several companies where they have databases that are 40, 50, 100, even 200 TB. If you want to do a full backup of a 40 TB database in 12 hours, you need over 1100 MB/s! If you had global dedupe, you could do that easily by spreading the load across multiple nodes. Its backups would find each other and they would get deduped together. Try that with local dedupe. None of the local dedupe systems listed in the table in the performance post can handle 1100 MB/s for 12 hours and still dedupe it in 24 hours. Again, while this is much rarer than the other challenges discussed here, it is an insurmountable challenge for those who have it.

One counter-argument from some vendors that don’t have global dedupe is to mention that they (DD, EMC, Quantum) use the hash-based approach that compares everything to everything, where other vendors (IBM, SEPATON) use a delta-differential approach that only compares like backups to each other. For example, it compares the latest backup of Exchange to the previous backup of Exchange. Therefore, these vendors say, they will find duplicate data that the other vendors won’t find (because they’re comparing everything to everything). Therefore, while they don’t have global dedupe, they do have better dedupe. Short answer: not so fast, Hoss.

First, not every global dedupe vendor uses a delta-differential approach; FalconStor is also a hashing vendor. Second, I refer you to the earlier discussion in this blog post that most duplicate data is found when comparing like backups to each other; very little redundant data is found when you compare dissimilar backups to each other. Third, delta differential vendors get better dedupe on the backups they do compare, as the delta differential approach looks at things at a deeper level of granularity. Hashing vendors usually use a chunk size of 8 KB or larger. What happens if just a few blocks in that 8 KB change? A hashing vendor would see the entire 8 KB as a new chunk, as its hash would have changed. A delta differential vendor, on the other hand, would be able to identify the unique blocks in the 8 KB and store only them. Get it?

One vendor suggested that their lack of global dedupe was offset by their system cost, which was much lower than the other vendors’ systems. That may be a reason for you to consider them over a vendor that has global dedupe. I would caution you to remember that you purchase the system once, but you use it forever. Therefore, remember to consider the savings in operational costs that global dedupe brings. (Take a second look at the paragraph that starts with “Let’s suppose…” for an example of the difference in operational cost.)

Another thing to consider when thinking about global dedupe is the install base of the product. Some of the global dedupe products are very new and/or have very few customers. On the opposite extreme you have Data Domain with local dedupe and around 3000 customers. IBM/Diligent has hundreds of customers, but I’m not sure how many of them are running the two-node cluster with global dedupe. Quantum and EMC have also done quite well in a very short time selling their system that doesn’t have global dedupe. Exagrid also has several hundred customers. If you took all the customers of FalconStor, NEC, & SEPATON that are using a multi-node global dedupe product and added them together, I’m not sure if you’d break 100. (In defense of those vendors, though, the systems they have sold tend to be much larger than the ones sold by other vendors. In addition, it’s quite difficult to sell against the juggernaut that is the Data Domain sales machine. While their products may not have global dedupe, their sales people are relentless, and they’re the acknowledged market leader. I “pity the fool” that has to sell against those guys.) Having said all that, if you need global dedupe, you need it. The fact that you’re one of the first to use it shouldn’t stop you – assuming you test it and it works.

Summary: if you’re backing up less than 10-20 TB a night, don’t worry about global dedupe for now. If you’re backing up significantly more than 10-20 TB a night, there are many reasons why you should only choose a vendor with global dedupe and few reasons for choosing one that doesn’t have it.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

You call this a "thought exercise," as if it’s not based in reality. It’s completely the opposite. I’m trying to use numbers and logic to help people understand what I see every day in the field; customers who back up more than 10-20 TB a night want and NEED global dedupe. Not having it costs them money in disk; it costs them money in opex; and it just makes things harder. You can ignore that FACT if you’d like, but it IS a fact. I see it EVERY day. (Just today, I talked to a customer that needs 4500 MB/s to back up ONE SERVER in a twelve-hour period. Yes, that’s an EXTREME case, but I’m just making the point that this all comes from real people and real datacenters, not "thought exercises."

Second, this is about throughput, NOT capacity, because it’s throughput that causes dedupe customers to buy more boxes — not capacity. In the real world, very few local-dedupe dedupe boxes (regardless of vendor) ever scale to the advertised capacities before their customers say they’re out of throughput. If customers buy the max config out of the box, they actually end up wasting a lot of disk as a result. (Hear that customers? Don’t max out your box with storage before you max it out with throughput, OK?) As to who this post is aimed at, I said it right in the post, "if you’re backing up less than 10-20 TB per day (including weekends) from your entire data center, global dedupe is not a problem you need to worry about." If a customer has more than that, though, the lack of global dedupe can cost them real money.

So since this has to do with throughput and not capacity, using slots and tapes to make your argument is a complete non sequitur. So let me make your argument with tape drives, then show you why it STILL doesn’t apply. If I took your argument and used tape drives instead of tapes, I’d say "OK, so lets follow this logic through just a little bit. A DL3000 from EMC has 400 MB/s of throughput. That’s roughly 4-5 LTO-4 tape drives (no compression) or 2-3 at 2:1 compression. So, more accurately, not having global dedup is like having a tape library that doesn’t scale past 2-5 tape drives, which is pretty much NOBODY."

So the first problem with that argument is that all tape library manufacturers make tape libraries with more than 2-5 tape drives. In fact, that’s where most tape libraries START. The largest tape libraries in the world hold HUNDREDS of tape drives. (The STK SL8500 holds 448; the IBM TS3500 holds 192; the QTM Scalar 10k holds 324.) So it’s OBVIOUS (to me, at least) that this kind of throughput is needed by SOME companies.

The second problem with that argument is that having 100s of tape drives is FINE; but having several dedupe ISLANDS is NOT. Buying another tape drive for additional throughput costs you nothing but that tape drive. Buying an additional local-dedupe system for scaling purposes means buying an addition head AND enough disk for another full copy of your data. (Because you WILL store two full copies of your backups if you have two local-dedupe systems and load-balance across them.)

And don’t tell me this extra disk only costs an extra 10%. That’s the "there’s only a 5% difference between 10:1 dedupe and 20:1 dedupe" math again; you know I think that’s bogus. Let me explain.

Suppose a customer needs 2500 MB/s of throughput, and as a result buys five 500 MB/s local-dedupe boxes, and then follows your logic (of "it’s only 10%) and load balances across all five of them. (I realize this is a big stinking customer, but we’re only big stinking customers need global dedupe; I’ve already said that.) At 2500 MB/s, this customer is backing up a little over 100 TB a night (2500 * 3600 * 12 = 108). My experience shows me that in this configuration (distributed weekly full backup with daily incremental), about half each night’s backup comes from fulls, and half comes from incrementals. If that’s the case, then this fictitious datacenter is a 350 TB datacenter (100 TB * 50% * 7 days = 350 TB). While a datacenter this size was unheard of 10 years ago, it is way too commonplace today.

Let’s back up this 350 TB environment with a 90 day retention, shall we? 90 days at 100 TB a night means 9000 TB, or 9 PB of backups. (That fits right into what I’ve seen, that for every 1 TB of live disk, I generally see at least 20 TB on tape.) At 20:1 dedupe, that means I need 450 TB of raw disk to hold 90 days of backups for my 350 TB environment. (Hopefully that’s 350 TB of affordable SATA disk, which should be a lot cheaper than my 350 TB of DMX or USP disk.)

But wait. In order to meet my throughput requirements, I needed to buy five heads and I load balanced across them (because you told me that it would only make a 10% difference). That means that I need to add four additional full copies in there. (Because the first time I back up my 350 TB datacenter to a new member of this local-dedupe array, I’m going to write it in full.) That’s 1400 TB of extra stuff I have to store because I don’t have global dedupe. You’re saying that the difference between 9000 TB and 10400 TB is about 14%. While that is correct, it doesn’t speak to the issue at hand.

The reality is that I will need to add 350 TB of ADDITIONAL RAW DISK to EACH of the other four nodes in my "array" because I have to start each node with a base copy. I therefore started out needed 450 TB of disk (9000 TB/20); now I need 1850 TB of disk. THAT’S the real difference, and it’s a 400% difference, not 10% or 14%!

Saying that global dedupe doesn’t make a significant difference to those that need it is just silly.

Curtis,

Thanks for this whole series of blog posts.

My only comment for this one is that I consider exchange and the user file systems also to be related backups just because so much of the data in the Exchange information store is attachments that are also in the file server(s) as users edit and email or receive and edit files.

– howard

Howard,

I couldn’t agree more. So if you split Exchange and filesystem data, you’ll also decrease your dedupe ratio.

Of course, that helps solidify my point that global dedupe is important, right?

Of course, if you are not backing up the source filesystems (like individual laptops), then the point is entirely moot. I worked at a large corporation where the individual laptops/desktops were not backed up and were left entirely to the individuals to do some kind of daily/weekly copy of important data to another location. What that meant was that for the most part, that stuff didn’t get backed up at all.

With all due respect, the common practice to which you refer is wrong, wrong, wrong — especially with the technology available for the last 5-10 years or so. It’s totally possible to back up all laptops and desktops.

In addition, I think Howard was referring to companies that have fileservers where users store their files, which is also a very common occurence. In that scenario, you will find common data between those fileservers and Exchange/Notes.

So, not having global dedupe is much like not having a tape library, but instead having individual drives connected to different servers? Which works fine if you either don’t have a large amount of data or if your environment is fairly static.

OK, so lets follow this logic through just a little bit, making an assumption of 20:1 deduplication. A DL3000 from EMC has 148 TB of useable storage. At 20:1, that would be roughly 3 PB of backup data. 3 PB of backup data is roughly 7500 LTO3 tapes (no compression) or 3750 at 2:1 compression.

So, more accurately, not having global dedup is like not having a tape library that scales past 3750 cartridges. Which is pretty much everybody, in practice. Yes some very big SL8500s and IBM libraries can, but in practice they tend not to due to floor space restrictions. Instead, customers opt for multiple libraries as they don’t require contiguous rack space (we are talking about 30-40′ or more of contiguous rack space!).

So, does global dedup matter? Probably more in the abstract than in reality. In reality, it only accounts for about 10% of the savings from dedup. In reality, it doesn’t matter at all unless you have more than 3 PB of backup data to retain on disk.

So does it matter? At some point–but probably that point is pretty far away for a lot of folks. Even for those who have reached that point, and who retain more than 3 PB, it is still only one of many things that matter.

Mr. Waterhouse, the whole point of global dedupe and why it is important has nothing to do with 20:1 ratios, PB’s of retained data, LTO3’s, or tape libraries. The heart of the matter is how much data are you backing up tonight and how much will you backup tomorrow night but also, and more importantly, how much easier does it make my life. As a storage admin, I do not want the added administrative burden of deciding which node/head/appliance my exchange data will go to tongiht and then having to assure it gets to that same node/head/appliance the next night fot the sake of dedupe just because the 2 nodes/heads/appliances can’t sit down for coffee and reconcile their differences. I would much rather be able to send my exchange data to the backup target and know that dedupe will handle all of that data equally regardless of ingest point. It also makes load balancing easier. Not having to calculate how much data is going through one head and then trying to balance that number against how much is going through the other saves me time and headaches.

So does it matter? Maybe not to you right now but certainly matters to me.

@Aaron

That means a lot coming from you (a real actual customer). Real comments from real customers always have more weight in my courtroom.

I know of several companies where they have databases that are 40, 50, 100, even 200 TB. If you want to do a full backup of a 40 TB database in 12 hours, you need over 1100 MB/s!

Why aren’t those customers configuring their DBMS/backup plug-in to do incremental backups? How much of that 40T really changed in 24 hours? 1GB, 5GB, 10GB? With physical tape, 6 weeks of daily incrementals might be impossible to restore in a small window due to mount/dismount/seek times. With a VTL, the incremental restores should be quite fast. If that full backup is only run once a month, then the machine has a significantly larger window to perform the dedupe operation in an off-line manner.

It seems to me, that many of the dedupe vendors recommend full backups simply to make their numbers look better. If you backup ten 1T incrementals and they get a 2:1 compression ratio, you have stored 5T. If you backup ten 10T fulls and they get 20:1 you have still stored 5T of data. I see some vendors requiring certain backup strategies to assure large dedupe ratios, and it screams of snake oil.

Are those full backups being done to gain extra redundancy? Do full backups to a dedupe system, really give the same level of security that full backups to multiple physical tapes provide? The dedupe system removes all that explicit redundancy down to a single point of failure. If the disk array containing a piece of data common to half the backups fail, then you have lost half of your backups (ignoring heroics).

I never said they didn’t do incrementals. But when it’s time for a full, you need the kind of throughput I talked about there.

I specifically tell users NOT to do things on purpose (like more fulls) just to get better dedupe ratios. It wastes time and resources to gain nothing.

But… You’d be surprised at how many DBAs “don’t want no stinking incrementals.” They’re still carrying resentment from being burnt on them from 10-15 years ago or something.

I am cracking up over the comment about the DBAs “don’t want no stinking incrementals.” Have you been speaking to my DBAs? Or my Unix admins? Or my Exchange admins? Or my SysDev people? It has been such a struggle to change the mindset that incremental backups are not evil especially now that all the data is on disk (VTL) and can be recovered at a drop of a hat. No more waiting for offsite tapes. No more mount/seek/unmount times. The data is on disk. The disk does not care if the backup is a single full or the last full and 6 incrementals that it is restoring.

And now with DeDupe added, the pesismism from these folks has multiplied. Now, not only does everything have to be a full backup to be able to successfully recover from but they do not want their backup data being sent to this black magic voodoo disk device that warps the data (dedupes it) into some x-files like hybrid creature that must be contained underground.

I know the DBAs still think like this. That is why I have a trillion .bak files in my SQL file system backups even though we are using an SQL agent in our backup software. I could filter them out but I can just see it now when they ask for some .bak’s to be restored. I pick my battles with em.

I never said they didn’t do incrementals. But when it’s time for a full, you need the kind of throughput I talked about there.
None of the local dedupe systems listed in the table in the performance post can handle 1100 MB/s for 12 hours and still dedupe it in 24 hours.

I guess I wasn’t clear, when I said: “If that full backup is only run once a month, then the machine has a significantly larger window to perform the dedupe operation in an off-line manner. ”

I tried to point out that it may not be necessary to do the dedupe operation in 24 hours. The full backups could be staggered to give the machine sufficient time to finish the dedupe operation. That 200T might be a once a week or once a month operation. Or in the case of multiple 200T backups they could be scheduled a day or two apart on a larger schedule to give the dedupe operation sufficient time to complete.

Aaron,

In some ways, that pessimism is probably waranteed. Conceptually, incrementals and dedupe don’t give you a complete snapshot of the database at any given time. Transactional consistency can be very important for some applications, and having a month of incrementals is probably enough to make a lot of folks squemish. Its not like there arn’t stories of recovery operations failing, leaving the DBA manually editing rows to reconstruct a database. Having a history of completly standalone full backup scattered all over the planet provides a safer feeling than having a month old full backup in the vault and the rest of the data sitting on the “black magic voodoo disk device”.

The other side of the coin is, once your database gets to 100T, hopefully the DBA has a better plan than 100’s of full backups scattered everywhere. At that point they probably should be testing recovery scenerio’s too. Its not like a configuration error in the backup software has ever caused a full backup to be unrecoverable.

The DBA’s pessimism my at times be warranted but it should not be driven by paranoia and too many times that is exactly the case. That is why .bak’s are sitting all over file systems.

Strings of incrementals and dedupe conceptually do not give you a complete snapshot. That is why dedupe is only a part of an overall data protection strategy. I believe that dedupe increases the value of tape. In this market, many vendors will push that dedupe is a tape replacement. I stand on the opposite side of that. After all, I feel much better about deduping and storing my backup data on disk knowing that it is all reduped and stored as normal on tape.

As I do agree with you that global dedup is important, its importance really depends on your priorities. I know for us replication and the expense of a larger wan pipe where huge factors in the decision. At the time several the companies which globally deduped had no good way to replicate that data offsite. So I don’t care how good my dedup is I need to get that data offsite, which is where other vendors excel. Today I am very happy in our decision into Data Domain but you are right we are a shop that moves 7-10 TB a night. Great article just want to get my 2 cents in.

16 comments

Further reading

Learning from Disaster: Takeaways from StorageCraft’s Cloud Backup Outage

Lessons Learned from the Rackspace Ransomware Attack

Lessons from Carbonite Lawsuit: Why Backup Vendor Due Diligence is Crucial

Salesforce’s 2019 Permission Blunder: Why SaaS Backup is Non-Negotiable

KPMG’s Microsoft 365 Data Loss Disaster: A Wake-Up Call

The OVHCloud Dumpster Fire (literally and figuratively)