A lot of FUD has been passed around lately about the probability of a hash collision when using a hash-only de-duplication system. What is a hash-only de-dupe system? What is the real probability of a hash collision in such a system? Read on to find out.
In a recent very long thread (with 63 posts) on the NetBackup Mailing List, someone took issue with a quote of mine from my January article, "The Skinny on Deduplication." In that article, I pointed out that the odds of having two different blocks of data have the same hash (known as a hash collision) are 1:2^160, which is an astronomical number. They said that what's important is the probability of a hash collision in a given environment, and those odds increase with the size of the environment. They told me to read up on the Birthday Paradox to see what they're talking about. I did, and now I'm back. So here goes.
Hash-only De-dupe systems
First, whatever your thoughts on hash collisions, the concerns about them only apply to de-duplication systems that use only a hash to identify redundant data. Other vendors, who use either a secondary check to verify a match, or who don't use hashes at all, don't have to worry about hash collisions. We maintain a list of de-dupe vendors and what methods they use to identify redundant data in our Wiki.
If a vendor uses only a hash to identify redundant data, I think it's perfectly reasonable to discuss the probability of hash collisions. Trust me, they've had the question before. But just what are the odds?
First, let's discuss the Birthday Paradox, which isn't really a paradox. It's just a way to illustrate how probabilities aren't quite what they may seem. With a "key space" of 365 days in a year, the probability of any two people having the same birthday are 1:365. (This is analogous to me saying that the probability of two different blocks having the same hash are 1:2^160 if you have a key space of 2^160.) But, what are the odds that two people in a room have the same birthday? The more people there are in the room, the higher the probability. There actually only needs to be 23 people in the room to increase the odds to 50%! That doesn't seem right, but read this Wikipedia entry to understand how that works. Applying that logic to the hash collision question, the odds that a given company will have a hash collision somewhere in their environment increases with the size of their environment (analogous to the number of people in a room increasing the chances that two of them will have the same birthday.)
So… if the odds of a hash collision increase with each new block in an environment, how big of an environment must we have before the odds of having a hash collision are unacceptable? The first thing we did was to convert the simple formula in the Wikipedia entry into a formula in an Excel spreadsheet. (You can download the spreadsheet I used here .) Where F8 is the number of blocks in an environment, and F5 is the number of keys in the key space, the formula for the probability of a hash collision in a given environment is 1-EXP((-(F8^2))/(2*F5)).
The number of blocks in a given environment is a function of how many TB of disk they have, divided by the average chunk size of the de-duplication system. (Such systems use chunk sizes ranging from 8KB to 128KB.) The number of keys in the key space is determined by the hashing algorithm. SHA-1 (the most common hash) is a 160-bit hash, so this number is 2^160.
Unfortunately, Excel only calculates things to 30 decimal places. In order to have anything other than 0.000000000000000000000000000000% register in the calculated odds field, you need to have 71 quadrillion (70,921,000,000,000,000) blocks. At that point, the odds will change to 0.000000000000177635683940025000%. To have that many blocks, you need 530 EB with 8 KB chunks, 1.1 ZB (Zettabytes) with 16 KB chunks, 1.6 ZB with 24 KB chunks, and 8.4 ZB with 128 KB chunks. Anything less than that, and the odds are zero.
I wanted a little more precision, so I converted the Excel formula into a PHP formula, since it calculates up to 50 decimal places:
$value=1 – exp((-(pow($blocks,2)))/(2 * $keyspace));
$value=$value * 100;
Using this formula, I discovered that I had to have at least 12.8 quadrillion blocks to get the odds to appear with 50 decimal places. Anything less than 12.8 quadrillion blocks, and your odds are 0.00000000000000000000000000000000000000000000000000%. With 12.8 quadrillion blocks, your odds are 0.00000000000001110223024625156540423631668090820313%. To have 12.8 quadrillion blocks, you need 95 EB with 8 KB chunks, 190 EB with 16 KB chunks, or 290 EB with 24 KB chunks.
In comparison, the odds of having an undetected bit error rate every time you write to an LTO tape drives are 0.000000000000000000000000100000%, or 1 in 10^(27).
So, there you have it. If you use the product with the smallest chunk size (making your odds worse), and you've got 95 Exabytes of data, your odds of having a single hash collision are slightly higher than the odds that you face every time you use an LTO tape. (And we haven't applied the Birthday Paradox to LTO, either.)
Put another way, if you have 95 EB of data, you have a 0.00000000000001110223024625156540423631668090820313% chance that you have will your system will discard a block of data that it should keep. The only thing that will effect is a restore that requires that particular block of data. Anyone want to calculate the odds that the block of data that got corrupted actually ends up being needed in a restore?
And if you have something less than 95 EB of data (and you know you do), then your odds don't appear in 50 decimal places.
I think I'm still OK with these odds.