I read with amusement Lauren Whitehouse’s latest blog entry about global dedupe. I’m not saying that I was laughing at anything she was saying, per se; I totally agree with the whole article. I was amused by what customers said about global dedupe.
Lauren said that ESG has talked to customers and that global dedupe ranks towards the bottom of their wishlists; they’re concerned about other things. What they want, she said, was ” cost, ease of implementation/use, impact on backup/recovery performance, integration with existing backup processes, and scalability.”
Guess what? Global dedupe impacts every single one of those things! This is why the title of this blog post is “Customers don’t want global dedupe, but they’re asking for it without knowing it.” Get it now? Let’s take a look at why the things they’re asking for are impacted by global dedupe.
- When multiple nodes compare backup data, it gets deduped against a larger pool of data; therefore, you get an increase in your overall dedupe ratio. It’s not the biggest benefit of global dedupe, but global dedupe does reduce cost by increasing dedupe ratio.
- Companies who don’t have global dedupe are forced to have the biggest/fastest appliances, requiring them to ride the crest of the CPU/RAM cost wave. Companies that offer global dedupe can buy much less expensive CPUs and motherboards. Their per-node throughput rates may be less than their competitor’s numbers, but it doesn’t matter — becuase their nodes all act as one. Moving data between nodes for scalability and performance reasons doesn’t impact your dedupe ratio or cost.
- Ease of implementation/use
- Which is easier to implement/use? A bunch of individual dedupe silos that know nothing of each other or a single system of multiple nodes that all act as one system? I can tell you from talking to way too many customers that the answer is the latter. in fact, I know of a very large company who has a very large deployment of targe dedupe systems because they don’t have global dedupe and the increase in management cost that this has created has caused them to reconsider whether this was a good idea or not.
- Impact on backup/recovery performance
- There are some systems that are large enough that they need to be backed up to multiple target systems in order to meet throughput requirements. How do you do that if the multiple nodes don’t talk to each other?
- Integration with existing backup processes
- If you’re backing up to non-deduped disk or tape, you can essentially treat a bunch of individual tape drives or disk arrays as a single system. It doesn’t matter what gets sent to what; you’ll get it back regardless. If you have a multi-node dedupe system that has global dedupe you can treat it the same as you would disk or tape. However, if you are using a multi-node setup of a dedupe system without global deduplication, you have to introduce a new process of divvying up your backups into multiple subsets that can fit into each node. Then when the size of those systems change, you have another new process of moving things around. You can’t just point backups to a new system you buy (the way you would with tape, disk, or a global dedupes system), you have to move the old data (perhaps TBs and TBs of it) from the old system to the new system or you will start from scratch with your dedupe process.
- Of course this is about scalability, but not just in terms of how big you can get. It is true that the high-end servers from companies that don’t offer dedupe are big enough to meet the needs of many (if not most) customers. But what if you don’t or can’t buy their high-end product? Can you scale that system to meet larger needs? The answer is no, you can’t. You can buy another one, or you can throw out the one you have any buy another one, but that is not scaling. That is a forklift upgrade. Global dedupe is about scalability, too. Buy what you want/need now, and then scale it to meet your needs as they grow. If you have global dedupe that’s easy. If you don’t, it’s not.
Now you see why I say that customer don’t want it, but they’re asking for it. I hope this helps.
----- Signature and Disclaimer -----
Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.