How to REALLY analyze dedupe ratios and their impact on cost savings

I’m surprised that Dipesh Patel of CommVault joined the “dedupe ratio doesn’t matter” chorus with his blog post “How to analyze dedupe ratios and its impact on cost savings.”  I’ve met him and know him to be an intelligent person, so I’ll post this and hope for the best.

The basic argument in Dipesh’s blog (and other blogs before him) is that since a 10:1 dedupe ratio reduces data by 90% and a 20:1 dedupe ratio reduces it by 95%, the incremental savings between the two is only 5%.  Dipesh says that vendors that argue “that a doubling of dedupe ratios is a doubling of savings” are using “sleight of hand.”

I argue exactly the opposite and did so in a blog post over a year ago. I say that vendors who are using incremental savings are using sleight of hand.  The question in the end is how much disk will you have to buy, manage, power, and cool — and I believe that the manage, power and cool parts of that equation are extremely important and are completely absent from Dipesh’s calculations.

A customer that is able to dedupe their data at 10:1 will buy twice as much disk as if they were able to dedupe that same data at 20:1.  (If you have 100 TB of backups and you dedupe it at 10:1, you need 10 TB of disk.  If you dedupe it at 20:1, you need 5 TB of disk.)  That’s twice as much disk to manage (monitor, replace on failure, etc), power, and cool.  The IT department is the largest part of most company’s power bill, and the storage department is often the largest part of the IT department’s power bill.  Since a backup system typically holds 10-20 GB for every 1 GB on primary storage, I argue that the backup system’s disk power bill could possibly be the biggest percentage (backups) of the biggest percentage (storage) of the biggest percentage (IT) of the power bill.  Cutting that in half (or not) is kind of a big deal.

Update: Jay Livens of SEPATON posted his thoughts on this subject on his aboutrestore.com blog.  In addition to the power/cooling costs I posted, he pointed out the same thing can be said about replication costs and bandwidth.

So much for the incremental savings argument.

Dipesh’s blog post also makes the argument that most vendors save about the same amount of disk and that the characteristics of your data are what really determines your dedupe ratio.  While I completely agree with the latter, I do not agree with the former, unless I highlighted the word most.  I do argue that most of the time it is not the vendor that you buy that determines your dedupe ratio, it is the characteristic of your data and how you back it up that determines your dedupe ratio.  Having said that, I have seen scenarios where one vendor got 100 times more dedupe than another vendor — with the same data!  This is why I think you should always be testing more than one vendor when testing dedupe solutions.

Since this blog is talking about calculating costs I think it’s important to point out that most customers are using something other than CommVault Simpana.  Why is that important?  CommVault makes the argument that their dedupe is superior to Data Domain, SEPATON, Quantum, IBM & Exagrid’s target dedupe solutions.  They are able to do things with their dedupe solution that the target vendors cannot do (such as encrypt & compress data before sending it over the network).  But unlike the target dedupe vendors, you have to switch your backup software from whatever you’re using to CommVault in order to get the benefits they’re offering.  That conversion comes at a huge cost and risk that includes your initial purchase, education classes, possible professional services for installation, hours spent poring over new manuals and on support calls to understand your new backup solution — all while your backup system goes through a possibly very long period of instability. I don’t care how good a backup software product is; the above things are going to happen.  You may still feel that the change is worth the cost and the risk — just make sure that you consider all these costs into your TCO analysis.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

9 comments
  • Thanks for the post! Timely for me.

    Shailen Patel from Commvault, (any relation to Dipesh Patel?) has been out to see me a couple of times lately because I am thinking about testing Simpana along side Networker 7.5. Networker can’t dedupe without Avamar and we’d rather use our own disk than an appliance right now. Networker VCB is a pain. From what I’ve seen about it’s non-deduped backup to disk, you can clone a back up which doesn’t purge the disk afterwards, or you can stage it to tape and purge it but the tapes it stages to are not browseable. Only what is on disk is browseable for the retention period set for the client. It seems unable to handle knowing that the save sets were moved off to tape and that you would want your retension policy to transfer to the tapes it staged to. Because of this, staged tapes have to be cataloged for any and all recovers. Since we can’t dedupe without Avamar, it seems worthwhile to take the Simpana for a test drive.

    You are so right about changing backup solutions. It took forever to get Networker to be running stable and smooth. It better be fantastic to make me change after all that trouble.

  • VCB is a pain everywhere, IMHO. It’s already been EOL’d by VMware, so I wouldn’t change backup software products just to fix that problem. Also, NetWorker can dedupe just fine to a Data Domain box (or any target dedupe box). I don’t want to sound like I’m anti-CommVault here, cause I’m not. (I am anti-changing-backup-solutions unless there’s no alternative.) But CommVault is basically target dedupe. I wrote a post about that, too. (I’m sure it’ll be source dedupe soon, but right now the data is deduped at the media agent level.)

    There’s nothing wrong with taking it for a test drive, though. I’m never against that.

  • Curtis, while your math is accurate, I think you take a glass-half-empty approach to the subject. Take your example:

    If you have 100 TB of backups and you dedupe it at 10:1, you need 10 TB of disk. If you dedupe it at 20:1, you need 5 TB of disk.

    True enough. But the point is that even at 10:1, I’m saving 90 TB of disk. Sure, 10 TB is twice as much as 5 TB, but it’s still 90 TB saved vs 95 TB saved. The difference in the savings is much smaller than the difference in the resulting disk footprint. A 90% savings is still huge any way you slice it.

    That being the case, I don’t thinking a product decision should be made on a 10:1 vs 20:1 basis only. You can easily lose that extra “value” if the product doesn’t work for you operationally.

  • What matters in the end is how much disk you have to buy, power, and cool. And the difference between 10:1 and 20:1 is a 2X decrease (or increase) in that amount. Who cares how much you have saved so far; what matters is that if you buy the 10:1 system you will need to buy twice as disk — and replicated twice as many backup bytes — than if you bought the 20:1 system. It doesn’t matter that you already saved 90%; what matters is what happens NEXT. Your incremental savings are 50%, not 5%.

    And anyone who says to buy a product simply due to its dedupe ratio is an idiot, so you won’t hear me saying that. But what I am saying is that the main point of the second half of Dipesh’s argument is wrong. Dedupe ratio totally matters.

  • When I first started looking into dedup solutions 3 yrs ago, the initial feature that I was hit with from the vendors was dedup ratio. "Ours is better than theirs", "No, ours is better". Once I started learning more and more IMO the single best feature of some dedup solutions is replication. A good replication scheme will trump or enhance the best dedup ratio. I agree that the better the dedup ratio will result in less disk overhead, but if you’re a large site (or any size site for that matter) with DR considerations if you can eliminate making tapes and start sending dedup data to remote sites, we’ll you’ve saved more than the difference between 90 and 95% dedup.
    The bottom line is the calculation of dedup’s benefits are endless.

  • A good dedupe ratio will reduce the amount of data that has to be deduped. Remember I’m not arguing that dedupe ratio is everything; I’m simply arguing that it’s something, and an important something. Since I’ve seen that some products are also better at replicating than others, I’ll say that it’s also an important something. But nobody’s arguing it’s not. 😉

  • We have all kinds of disk benchmarks. What about dedup benchmark? Not the ones provided by dedup vendors them self but real objective one like Storage Performance. Even if I do not like Storage Performance benchmark because they test SAN storage with one server…should it be at least 20 to 50 servers to replicate real life IT? Also with various apps workload like Exchange, file server, Database. Not canned benchmark with one test pattern at the time.

    I found good and objectives benchmarks disappearing from IT industry. The ones I see are funded by vendors in 99% of cases. Reading fine print reveal that.

    It is maybe time to start a new Gartner, IDC, Forrester type firm who is NOT vendors bias and funding…possible…maybe not.

  • We recently (past 6 months) bought and implemented Avamar with a clean sweep removal of NetBackup and TSM replaced by Networker. Implementing Networker was a BEAR and even after EMC having Avamar in their portfolio for 3 years the integration was kind of shaky.

    We also are just now purchasing Data Domain boxes to add/complete our backup model. I expect that implementation to be much easier now that Networker is mostly stabilized.

    In hindsight, I don’t necessarily believe that any of the components that we bought are the “best in class” for their particular niche. However, our number one objective was to have the flexibility of source and target dedupe under a single pain of glass.

    As Mr. Preston mentioned, the type of data you’re trying to dedupe is arguably the biggest success factor. We also appended the location of the data as another success factor and we wanted the flexibility to dedupe enterprise wide with the ability to choose where that deduplication occurred.

    The biggest tip I could share for anyone evaluating or looking to purchase is to not let the sales team set your pain points. If you let them, they will throw out a self-proclaimed “flaw” in a competitors product and launch an entire campaign to educate you on the evils and afflictions that flaw will cause. I think all the laid off Y2K marketing staff found homes at the dedupe vendors.

    For us, it came down to the simple fact that we were replacing a traditional backup solution to tape, backing up a secondary copy of production data, and we have very minimal retention requirements. Based on that, the differences between “spill and fill”, “global dedupe”, “Moore’s law” and all the other marketing buzzwords were of minimal importance because every product we looked at was faster than tape and used a smaller footprint than D2D.

    Finally, just like buying a car, the longer you’re willing to keep driving your trade in the better the pricing will be especially at the end of a month/quarter.

  • I have read many items about deduplication ratios and arguments back and forth. What people need to really look at is how it fits how your backup software works and your downtimes. Some technologies needs downtime for cleanup and deduplication process but if you are in a TSM environment you don’t have those. Also with more companies going global there are no downtimes so you need to also take that into affect if you use Netbackup and Commvault. Most people I talk to are using a combination of 2/3 products for their needs. I know this give you some extra management but in dealing with the issues a 1 solution would give you it is actually a time saving. Before purchasing a product make sure you understands how they dedup and how you are managing your data or you will be burned.