Customers don't want global dedupe, but they're asking for it without even knowing it

I read with amusement Lauren Whitehouse’s latest blog entry about global dedupe.  I’m not saying that I was laughing at anything she was saying, per se; I totally agree with the whole article.  I was amused by what customers said about global dedupe.

Lauren said that ESG has talked to customers and that global dedupe ranks towards the bottom of their wishlists; they’re concerned about other things.  What they want, she said, was ” cost, ease of implementation/use, impact on backup/recovery performance, integration with existing backup processes, and scalability.”

Guess what?  Global dedupe impacts every single one of those things!  This is why the title of this blog post is “Customers don’t want global dedupe, but they’re asking for it without knowing it.”  Get it now?  Let’s take a look at why the things they’re asking for are impacted by global dedupe.

  • Cost
    • When multiple nodes compare backup data, it gets deduped against a larger pool of data; therefore, you get an increase in your overall dedupe ratio.  It’s not the biggest benefit of global dedupe, but global dedupe does reduce cost by increasing dedupe ratio.
    • Companies who don’t have global dedupe are forced to have the biggest/fastest appliances, requiring them to ride the crest of the CPU/RAM cost wave.  Companies that offer global dedupe can buy much less expensive CPUs and motherboards.  Their per-node throughput rates may be less than their competitor’s numbers, but it doesn’t matter — becuase their nodes all act as one.  Moving data between nodes for scalability and performance reasons doesn’t impact your dedupe ratio or cost.
  • Ease of implementation/use
    • Which is easier to implement/use?  A bunch of individual dedupe silos that know nothing of each other or a single system of multiple nodes that all act as one system?  I can tell you from talking to way too many customers that the answer is the latter.  in fact, I know of a very large company who has a very large deployment of targe dedupe systems because they don’t have global dedupe and the increase in management cost that this has created has caused them to reconsider whether this was a good idea or not.
  • Impact on backup/recovery performance
    • There are some systems that are large enough that they need to be backed up to multiple target systems in order to meet throughput requirements.  How do you do that if the multiple nodes don’t talk to each other?
  • Integration with existing backup processes
    • If you’re backing up to non-deduped disk or tape, you can essentially treat a bunch of individual tape drives or disk arrays as a single system.  It doesn’t matter what gets sent to what; you’ll get it back regardless.  If you have a multi-node dedupe system that has global dedupe you can treat it the same as you would disk or tape.  However, if you are using a multi-node setup of a dedupe system without global deduplication, you have to introduce a new process of divvying up your backups into multiple subsets that can fit into each node.  Then when the size of those systems change, you have another new process of moving things around.  You can’t just point backups to a new system you buy (the way you would with tape, disk, or a global dedupes system), you have to move the old data (perhaps TBs and TBs of it) from the old system to the new system or you will start from scratch with your dedupe process.
  • Scalability
    • Of course this is about scalability, but not just in terms of how big you can get.  It is true that the high-end servers from companies that don’t offer dedupe are big enough to meet the needs of many (if not most) customers.  But what if you don’t or can’t buy their high-end product?   Can you scale that system to meet larger needs? The answer is no, you can’t.  You can buy another one, or you can throw out the one you have any buy another one, but that is not scaling.  That is a forklift upgrade.  Global dedupe is about scalability, too.  Buy what you want/need now, and then scale it to meet your needs as they grow.  If you have global dedupe that’s easy.  If you don’t, it’s not.

Now you see why I say that customer don’t want it, but they’re asking for it.  I hope this helps.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

4 comments
  • Speaking of which… did you notice that Data Domain announced their global dedupe array today? About darn time for the “defacto dedupe industry leader” to get on board. Makes me wonder how hard the EMC folks have been pushing the DD engineers since the acquisition…

    cheers!

  • I’ll be writing a blog entry ASAP about how I think it’s a good step in the right direction. Short summary: The product they released is good, and I’m very happy that they have released it. My concern is that it meets the needs of a small subset of their customer base — and potential customer base. Only those who can afford two 880s, are using NBU, and are using OST, get to take advantage of it today. Hopefully they will open it up soon to a wider audience.

  • Here’s my spin on global dedup. Without a doubt it’s where the future is headed, because it only makes sense from a scalability standpoint. I think it is growing based on demand. Many small companies don’t really need it right now because they aren’t filling more than one box. The larger companies are going to fill up multiple nodes anyway, so why not just buy multiple nodes and segregate the data accordingly. I really have no desire to see how my Oracle data dedups against my 1500 Windows servers. Just keep them seperate. Additionally, while dedup is no longer bleeding edge technology it’s still relatively new without a solid track record in terms of stability. I’m sure there are more than a few CIO’s that aren’t ready to put ALL of their data on dedup appliances, so they’re using it in a strategic space. As it proves itself it will continue to grow and global dedup will become a necessity.

  • I appreciate your input. I’m not what your affiliations are, but your points are basically Data Domain’s usual talking points, and I believe I explained in the post above why I don’t think they’re valid.

    Having global dedupe allows you to buy what you want (or can afford) now and grow to what you need (or can afford) tomorrow. It also allows vendors that offer it to buy less expensive nodes (rather than the fastest thing Intel has to offer) and pass the savings on. NOT having it requires you to do forklift upgrades when you outgrow what you bought (e.g. replace your DD510 head with a 690, then again with an 880, etc.), or requires you to divvy up your backups into multiple chunks (which you’re recommending).

    Only a person who’s never managed backups would think that this is a good idea (or no big deal). It has nothing to do with comparing Oracle to Exchange. It has to do with managing the “chunks,” what goes where, and what you have to do when one of them outgrows where you put it. I’ve talked to way too many customers that have had to manage this and they HATE IT.