


Written by W. Curtis Preston
Friday, 09 January 2009 18:19
I've now seen some vendors saying things to the effect that "once you get 10:1, dedupe ratio doesn't matter." They state that 10:1 saves 90% of disk, and 20:1 saves 95% of disk, so the difference is only 5% -- so why is everyone so concerned about dedupe ratio? To this 90%/95% comment I say, "balderdash!" Click Read More to see more.
This reminds of what my father always said: "figures never lie, but liars always figure."
Here's how they come up with their numbers. If you back up 100 TB using a dedupe ratio of 10:1, you need 10 TB of disk to store it. If you backup 100 TB with a dedupe ratio of 20:1, you need 5 TB of disk to store it. The difference between 10 TB and 5 TB when backing up 100 TB is 5%. By that math, the difference between 20:1 and 30:1 is only 2.5%, and so on. Therefore, they say, why do some vendors use dedupe numbers like 50:1? There's only an 8% difference between 10:1 and 50:1!
Again I say, "balderdash!"
The reason their math "works" is that they're comparing the deduped data to the original size of the data. But that's not what matters in a competitive situation, which is the scenario in which they are using it. What matters (when talking dedupe ratio) is how much disk one vendor will need versus how much disk the other vendor will need to hold the same amount of backups. And if vendor A is getting 10:1 and another vendor B is getting 30:1, then customers using vendor A to store their backups
will need to buy three times more disk than customers using vendor B. So saying that is insignificant is, how should I say... balderdash!
How much disk you have to buy
is really important. Disk isn't free even if you didn't pay for it. You're going to need to provide it power and cooling for its whole life. Suppose two competing vendors made up for their bad (or good) dedupe ratio by changing the software side of their pricing, so both 400 TB dedupe systems cost $1M. If one dedupes the data down to 20 TB (20:1), and the other dedupes it down to 40 TB (10:1), that's 20 more TB of disk you're going to have to provide power and cooling for. So again, don't tell me that doesn't matter.
I know that other things affect the amount of disk you must buy as well, like whether or not you need a landing zone or cache area, but this post is about the claims from some vendors that dedupe ratio doesn't matter. The way I see it, saying dedupe ratio doesn't matter is another way of saying that you have a bad dedupe ratio.
That's all I have to say about that.
Add comment
Comments
Vendors often twist these numbers.
Ask them "What does the ratio actually compare ?"
Most probably they would say "complete/full data backup" against "-)e-duplicated". But, then a very less percentage of customers are actually doing daily full-backups.
The algorithm used in de-duplication is not becoming commodity. Its just that some vendors present the information in more marketing friendly manner.
How many customers in forum or blog have actually achieved 1:300 ratios ??
Think about it. I have tried to capture the same discussion [url http://blog.druvaa.com/2009/01/09/understanding-data-deduplication/]here.
www.snia.org/forums/dmf/knowledge/white_papers_and_reports/Understanding_Data_Deduplication_Ratios-20080718.pdf
I suspect that people are saying that relatively low space reduction ratios indicate significant space savings -- not that the dedupe ratio doesn't matter. The arithmetic is pretty straight forward: Space Reduction Percentage = 1
The number of boxes is typically determined by throughput, not by capacity. But it also should be figured into the cost calculations as well.
On another note, wouldn't the number of boxes (how big and how small they can be, and global vs. local dedupe) be a more significant factor in upfront sizing as well as capacity growth costs?
You are correct that it is necessary to do repeated backups to achieve typically advertised rates of dedupe, yes. You actually can get SOME dedupe on the first full backup, but not a whole lot. (Hopefully at least 2:1, as most dedupe vendors do compress.)
That first full backup of your 100 TB will probably take up about 50 TB, allowing for typical compression and moderate dedupe. And, like you said, after that first full backup is when things really take off.
What you're speaking of (in the 3X comment) is the quantity of disk needed to do that first full backup. That's not what I'm talking about. I'm talking about the amount of disk necessary to store that full backup and all subsequent backups. Consider a 100 TB shop that is backed up to a dedupe system for 90 days. Let's say that's 12 full backups, and 78 incremental backups, each of which are about 10 TB. That's 1200 TB (12 * 100) plus 780 TB (78 * 10), for a total of (I'll round up) 2000 TB, or 2 PB.
If I store those backups on a system with 20:1, I will need to put 200 TB of disk in my dedupe system. If I store them on a system that gets 40:1, I will need to put 100 TB of disk in that system. That's 100 TB of RAID 6-protected disks, so someting like 125 TB of raw disk.
That's a lot of stinking power. Even if the vendor gave you the extra disk "free," you'd be paying for power and cooling on 125 1-TB disks with one, and wouldn't pay for them with the other.
So all I'm saying is dedupe ratio matters: and a whole lot more than 5%.
While I generally agree that suggesting that "it doesn't matter" may be an overstatement, I don't think your statement about needing 3x the disk is correct.
To achieve any dedup target depends on redundancy of data which usually means repeated backups over time. If I have 100TB of data, the first night's backup will need something approaching 100TB (likely less with compression and some inherent dedup). Only after that with subsequent backups will I begin to see the dedup effect -- which I can't REALLY be sure of until after the fact.
So, in terms of quantity of disk, I'm likely to need some base amount plus an additional amount that will vary depending on my confidence in the dedup-ability. This variance across different vendors is not going to be 3X!
As soon as you believe dedupe ratio doesn't matter, you'll get a call from the boss asking why you need MORE disk for backups.
RSS feed for comments to this post