Dedupe ratio DOES matter [Updated 3/2010]

I’ve now seen some vendors saying things to the effect that “once you get 10:1, dedupe ratio doesn’t matter.”  They state that 10:1 saves 90% of disk, and 20:1 saves 95% of disk, so the difference is only 5% — so why is everyone so concerned about dedupe ratio?  To this 90%/95% comment I say, “balderdash!”  Click Read More to see more.
This reminds of what my father always said: “figures never lie, but liars always figure.”

Here’s how they come up with their numbers.  If you back up 100 TB using a dedupe ratio of 10:1, you need 10 TB of disk to store it.  If you backup 100 TB with a dedupe ratio of 20:1, you need 5 TB of disk to store it.  The difference between 10 TB and 5 TB when backing up 100 TB is 5%.  By that math, the difference between 20:1 and 30:1 is only 2.5%, and so on.  Therefore, they say, why do some vendors use dedupe numbers like 50:1?  There’s only an 8% difference between 10:1 and 50:1!

Again I say, “balderdash!”

The reason their math “works” is that they’re comparing the deduped data to the original size of the data.  But that’s not what matters in a competitive situation, which is the scenario in which they are using it.  What matters (when talking dedupe ratio) is how much disk one vendor will need versus how much disk the other vendor will need to hold the same amount of backups.  And if vendor A is getting 10:1 and another vendor B is getting 30:1, then customers using vendor A to store their backups will need to buy three times more disk than customers using vendor B.  So saying that is insignificant is, how should I say… balderdash!

How much disk you have to buy is really important.  Disk isn’t free even if you didn’t pay for it.  You’re going to need to provide it power and cooling for its whole life.  Suppose two competing vendors made up for their bad (or good) dedupe ratio by changing the software side of their pricing, so both 400 TB dedupe systems cost $1M.  If one dedupes the data down to 20 TB (20:1), and the other dedupes it down to 40 TB (10:1), that’s 20 more TB of disk you’re going to have to provide power and cooling for.  So again, don’t tell me that doesn’t matter.

I know that other things affect the amount of disk you must buy as well, like whether or not you need a landing zone or cache area, but this post is about the claims from some vendors that dedupe ratio doesn’t matter.  The way I see it, saying dedupe ratio doesn’t matter is another way of saying that you have a bad dedupe ratio.

That’s all I have to say about that.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

9 comments
  • Have you ever noticed that whenever someone declares that “X doesn’t matter”, it’s just before whatever X is reaches up and slaps them upside the head?

    As soon as you believe dedupe ratio doesn’t matter, you’ll get a call from the boss asking why you need MORE disk for backups.

  • Hi Curtis,

    While I generally agree that suggesting that “it doesn’t matter” may be an overstatement, I don’t think your statement about needing 3x the disk is correct.

    To achieve any dedup target depends on redundancy of data which usually means repeated backups over time. If I have 100TB of data, the first night’s backup will need something approaching 100TB (likely less with compression and some inherent dedup). Only after that with subsequent backups will I begin to see the dedup effect — which I can’t REALLY be sure of until after the fact.

    So, in terms of quantity of disk, I’m likely to need some base amount plus an additional amount that will vary depending on my confidence in the dedup-ability. This variance across different vendors is not going to be 3X!

  • Hey, Jim.

    You are correct that it is necessary to do repeated backups to achieve typically advertised rates of dedupe, yes. You actually can get SOME dedupe on the first full backup, but not a whole lot. (Hopefully at least 2:1, as most dedupe vendors do compress.)

    That first full backup of your 100 TB will probably take up about 50 TB, allowing for typical compression and moderate dedupe. And, like you said, after that first full backup is when things really take off.

    What you’re speaking of (in the 3X comment) is the quantity of disk needed to do that first full backup. That’s not what I’m talking about. I’m talking about the amount of disk necessary to store that full backup and all subsequent backups. Consider a 100 TB shop that is backed up to a dedupe system for 90 days. Let’s say that’s 12 full backups, and 78 incremental backups, each of which are about 10 TB. That’s 1200 TB (12 * 100) plus 780 TB (78 * 10), for a total of (I’ll round up) 2000 TB, or 2 PB.

    If I store those backups on a system with 20:1, I will need to put 200 TB of disk in my dedupe system. If I store them on a system that gets 40:1, I will need to put 100 TB of disk in that system. That’s 100 TB of RAID 6-protected disks, so someting like 125 TB of raw disk.

    That’s a lot of stinking power. Even if the vendor gave you the extra disk "free," you’d be paying for power and cooling on 125 1-TB disks with one, and wouldn’t pay for them with the other.

    So all I’m saying is dedupe ratio matters: and a whole lot more than 5%.

  • So, I’m with you in how the math works out Curtis. Your last post suggests that there is a technology difference between vendors though that could mean 20:1 40:1… Are they really that dramatically different where you could make sizing comparisons, regardless of the backup data set. I’ve heard you say several times that the dedupe ratios need to be take with a grain of salt because they do not work off the same data set. If so, this 90%/95% argument seems to be nothing more than a vendor positioning point that’s based on yet more "balderdash".

    On another note, wouldn’t the number of boxes (how big and how small they can be, and global vs. local dedupe) be a more significant factor in upfront sizing as well as capacity growth costs?

  • NO calculations can be done regardless of the backup set. BUT the scenario I’m talking about is where one vendor is claiming 40:1 and another is claiming 20:1 — both to the same customer. IF the customer can achieve both numbers, my point is that it absolutely will change how much disk he/she needs to buy by 100%, NOT 2.5%, which is what these vendors are saying (vendors who are trying to defend against other vendors that are claiming better dedupe numbers).

    The number of boxes is typically determined by throughput, not by capacity. But it also should be figured into the cost calculations as well.

  • First, it’s not customers who are saying this; it’s vendors. Second, look at the “difference in disk savings,” as shown on page 6, and compare 20:1 to 100:1. That chart shows that the difference is only 4%. I’m saying that is only true when comparing the deduped data to the original data, not when comparing one deduped product to another. If one product gets 20:1 and another gets 100:1 with the same data, one product will need five times more (or less) disk than the other.

  • Curtis,

    Vendors often twist these numbers.

    Ask them "What does the ratio actually compare ?"

    Most probably they would say "complete/full data backup" against "-)e-duplicated". But, then a very less percentage of customers are actually doing daily full-backups.

    The algorithm used in de-duplication is not becoming commodity. Its just that some vendors present the information in more marketing friendly manner.

    How many customers in forum or blog have actually achieved 1:300 ratios ??

    Think about it. I have tried to capture the same discussion [url http://blog.druvaa.com/2009/01/09/understanding-data-deduplication/%5Dhere%5B/url%5D.

  • Yes I know that different vendors compute the numbers differently, but that is a separate discussion. The only point I was trying to make is that (even when computed the same), dedupe ratio DOES matter.