Note to Readers: I have updated thoughts on this topic and you can see them here.
My previous post on dedupe performance illustrated the impact that global dedupe has on the effective performance of different dedupe appliances. I received a lot of comments from vendors that didn’t have global dedupe saying one of two things. One thing they would say is that the vendors that claimed to have global dedupe didn’t really have it. I know too much to believe that. The other thing they’d say is that global dedupe wasn’t as important as I was making it out to be. Well, that’s definitely not the case, and that’s what this post is all about. Click Read More to see why I think global dedupe is critical for larger environments.
First let me say that, despite what some vendors seem to think, global dedupe is real (FalconStor, IBM, NEC, SEPATON all have it) and if a vendor really has it, then their 2, 4, or 8 systems really do behave like one from a dedupe perspective. If you don’t have it (DD, EMC, Exagrid, Quantum don’t have it yet), then you cannot add your systems together and call them one system (from a dedupe perspective). Unfortunately, looking at systems this way means cutting some vendors’ advertised numbers in half (EMC 4000 & NetApp VTL) or by 1/16th (Data Domain DDX). Data Domain, NetApp, and I speak all the time and I believe they understand what they have and don’t have, and they understand my position on their numbers. EMC and I, on the other hand, have talked, but haven’t talked about the global dedupe “controversy,” so it’s no surprise that some of their employees are reacting strongly to a post that cuts their advertised numbers in half, and multiplies other vendors’ numbers by eight.
And now the good news: if you’re backing up less than 10-20 TB per day (including weekends) from your entire data center, global dedupe is not a problem you need to worry about. 10-20 TB a night requires a throughput 230 to 460 MB/s if you have a 12-hour backup window. If you consult the table in the performance post, you’ll find that almost all of the vendors can handle that level of throughput with a single node; therefore, global dedupe is not something you need to worry about. However, if you’re backing up significantly more than 10-20 TB every night, global dedupe is a required feature, in my opinion. Let’s talk about why.
First let me say why it doesn’t matter. It’s not that vendors with global dedupe will get better dedupe ratios by comparing everything to everything. In other words, I’m not saying that FalconStor, IBM, or SEPATON will get a higher dedupe ratio because they compare the blocks in Oracle to the blocks in Exchange or the filesystem. While there may be a few common blocks between some of them, that is not where the bulk of duplicate data comes from. Most duplicate data is found between repeated backups of the same object. For example, there will obviously be many duplicated blocks of data in two full backups of Exchange, Oracle, any database, or any filesystem. There will also be many duplicate blocks between multiple incremental backups of an object, as incremental backups are rarely block-level incremental backups; they back up an entire databases extent if one block has changed, or back up an entire file if you changed one block in it. Therefore, as long as you’re always comparing the latest backup of a given Exchange database with the previous backup of that same Exchange database, you should eliminate most duplicate data.
I can hear the people in
The problem is that this asks the customer to do something they never had to do before. With tape or non-deduped-disk, they simply sent the backups to any device they could find; they didn’t need to worry about where it went. It didn’t matter where they went; it only matters that the backup software product knows where they went. You could load-balance across as many devices and as many types of devices as you needed to use with no ill effects.
Now they have to create equal-sized portions of their backups and send them to a single device. The first challenge with that is that many customers often don’t know how big their backups are. They usually can tell you that they use eight tapes per night, and that this roughly translates into two terabytes of backups. How big is each backup? They might know how big some of the backups are, such as that one really large database they have to back up. But the size of most of their backups will be a mystery, and solving that mystery takes a lot of research. They either need to be good at querying their backup product and interpreting the results (this can actually be quite hard), or they need an advanced reporting tool — something many environments don’t have.
The next challenge is what I’ll call related backups. While there aren’t a lot of duplicate blocks between Oracle and Exchange, there may be duplicate blocks between several different Exchange databases, or several different related filesystems. The best thing from a dedupe perspective (if you don’t have global dedupe) is to make sure all Exchange backups go to the same node. What if they won’t fit on the node, either capacity or throughput-wise? Splitting those related backups across two nodes will decrease your dedupe ratio, which another way to say increase your cost.
Let’s suppose that you are able to properly group together related backups, and to create several equal-sized backup portions to divvy up amongst several dedupe appliances. Wouldn’t local dedupe be OK then? Sure – for a while. But things never remain the same. Systems grow at different rates, and at some point some systems will become too big to fit into a single dedupe appliance. This means that the difficult process of creating equally sized backup portions is an ongoing one, adding to the operational cost of your systems. If you had global dedupe, there is no need for this process during deployment – and certainly no need to do this on a regular basis.
Availability is the next challenge because you’re pointing backups to a single device. What happens if that device becomes unavailable? Your backups stop, that’s what happens. You could load-balance your backups across two devices to solve this problem, but then you would ruin your dedupe ratio. If one night a given server is backed up to one node, and the next night it’s backed up to another node, its data will get stored twice — exactly the thing you didn’t want to happen. This decreases your dedupe ratio (AKA increases your cost). Therefore, you’re stuck with pointing your backups to a single device and having no failover or load balancing.
The bigger your servers are, the more important load-balancing becomes. In fact, it can actually save you money. Suppose you had seven 20-TB servers to back up and a 12-hour backup window. Assuming you were doing a weekly full backup, the first thing you would do is schedule a full backup of one node every night, and you would want to do an incremental backup of the other six nodes each night as well. An 5-10% incremental would create an additional load of 1-2 TB per node per night, for a total load of 6-12 TB of additional backup traffic – creating an total load of 26-32 TB per night, requiring a throughput of 700–888 MB/s for a 12-hour backup window.
Now suppose your dedupe appliance can only handle 500 MB/s. What must you do? If it were a non-dedupe device, you would just buy a second one and load-balance across the two, right? You’d have a total throughput capacity of 1000 MB/s and would have room to spare. The same would be true if you had two nodes of a dedupe system with global dedupe. However, if you don’t have global dedupe, you’d have to buy seven nodes instead of two. What? You can’t load balance, and you can’t send your full backups to one and your incremental backups to another because the dedupe ratio would be significantly reduced. What you need is to use a separate dedupe appliance for each server. If you then scheduled things so one full backup ran per night, what you’d have is one box being used to capacity (500 MB/s), and six boxes that were barely breaking a sweat, as they would only receive 1-2 TB of backups per night. What a waste.
The next, albeit rare, challenge is giant servers. What happens when the backup of a single server needs more throughput or capacity than a single node can provide? I know of several companies where they have databases that are 40, 50, 100, even 200 TB. If you want to do a full backup of a 40 TB database in 12 hours, you need over 1100 MB/s! If you had global dedupe, you could do that easily by spreading the load across multiple nodes. Its backups would find each other and they would get deduped together. Try that with local dedupe. None of the local dedupe systems listed in the table in the performance post can handle 1100 MB/s for 12 hours and still dedupe it in 24 hours. Again, while this is much rarer than the other challenges discussed here, it is an insurmountable challenge for those who have it.
One counter-argument from some vendors that don’t have global dedupe is to mention that they (DD, EMC, Quantum) use the hash-based approach that compares everything to everything, where other vendors (IBM, SEPATON) use a delta-differential approach that only compares like backups to each other. For example, it compares the latest backup of Exchange to the previous backup of Exchange. Therefore, these vendors say, they will find duplicate data that the other vendors won’t find (because they’re comparing everything to everything). Therefore, while they don’t have global dedupe, they do have better dedupe. Short answer: not so fast, Hoss.
First, not every global dedupe vendor uses a delta-differential approach; FalconStor is also a hashing vendor. Second, I refer you to the earlier discussion in this blog post that most duplicate data is found when comparing like backups to each other; very little redundant data is found when you compare dissimilar backups to each other. Third, delta differential vendors get better dedupe on the backups they do compare, as the delta differential approach looks at things at a deeper level of granularity. Hashing vendors usually use a chunk size of 8 KB or larger. What happens if just a few blocks in that 8 KB change? A hashing vendor would see the entire 8 KB as a new chunk, as its hash would have changed. A delta differential vendor, on the other hand, would be able to identify the unique blocks in the 8 KB and store only them. Get it?
One vendor suggested that their lack of global dedupe was offset by their system cost, which was much lower than the other vendors’ systems. That may be a reason for you to consider them over a vendor that has global dedupe. I would caution you to remember that you purchase the system once, but you use it forever. Therefore, remember to consider the savings in operational costs that global dedupe brings. (Take a second look at the paragraph that starts with “Let’s suppose…” for an example of the difference in operational cost.)
Another thing to consider when thinking about global dedupe is the install base of the product. Some of the global dedupe products are very new and/or have very few customers. On the opposite extreme you have Data Domain with local dedupe and around 3000 customers. IBM/Diligent has hundreds of customers, but I’m not sure how many of them are running the two-node cluster with global dedupe. Quantum and EMC have also done quite well in a very short time selling their system that doesn’t have global dedupe. Exagrid also has several hundred customers. If you took all the customers of FalconStor, NEC, & SEPATON that are using a multi-node global dedupe product and added them together, I’m not sure if you’d break 100. (In defense of those vendors, though, the systems they have sold tend to be much larger than the ones sold by other vendors. In addition, it’s quite difficult to sell against the juggernaut that is the Data Domain sales machine. While their products may not have global dedupe, their sales people are relentless, and they’re the acknowledged market leader. I “pity the fool” that has to sell against those guys.) Having said all that, if you need global dedupe, you need it. The fact that you’re one of the first to use it shouldn’t stop you – assuming you test it and it works.
Summary: if you’re backing up less than 10-20 TB a night, don’t worry about global dedupe for now. If you’re backing up significantly more than 10-20 TB a night, there are many reasons why you should only choose a vendor with global dedupe and few reasons for choosing one that doesn’t have it.
----- Signature and Disclaimer -----
Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technologist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.