Written by W. Curtis Preston
Sunday, 14 October 2007 00:51

A lot of FUD has been passed around lately about the probability of a hash collision when using a hash-only de-duplication system. What is a hash-only de-dupe system? What is the real probability of a hash collision in such a system? Read on to find out.

**This blog entry is one in a series on deduplication. The previous post was about my podcast, and the next post is about the maturity of deduplication . **

In a recent very long thread (with 63 posts) on the NetBackup Mailing List, someone took issue with a quote of mine from my January article, "The Skinny on Deduplication." In that article, I pointed out that the odds of having two different blocks of data have the same hash (known as a hash collision) are 1:2^160, which is an astronomical number. They said that what's important is the probability of a hash collision in a given environment, and those odds increase with the size of the environment. They told me to read up on the Birthday Paradox to see what they're talking about. I did, and now I'm back. So here goes.

**Hash-only De-dupe systems**

First, whatever your thoughts on hash collisions, the concerns about them only apply to de-duplication systems that use only a hash to identify redundant data. Other vendors, who use either a secondary check to verify a match, or who don't use hashes at all, don't have to worry about hash collisions. We maintain a list of de-dupe vendors and what methods they use to identify redundant data in our Wiki.

If a vendor uses only a hash to identify redundant data, I think it's perfectly reasonable to discuss the probability of hash collisions. Trust me, they've had the question before. But just what are the odds?

**Birthday Paradox**

First, let's discuss the Birthday Paradox, which isn't really a paradox. It's just a way to illustrate how probabilities aren't quite what they may seem. With a "key space" of 365 days in a year, the probability of any two people having the same birthday are 1:365. (This is analogous to me saying that the probability of two different blocks having the same hash are 1:2^160 if you have a key space of 2^160.) But, what are the odds that two people in a room have the same birthday? The more people there are in the room, the higher the probability. There actually only needs to be 23 people in the room to increase the odds to 50%! That doesn't seem right, but read this Wikipedia entry to understand how that works. Applying that logic to the hash collision question, the odds that a given company will have a hash collision somewhere in their environment increases with the size of their environment (analogous to the number of people in a room increasing the chances that two of them will have the same birthday.)

**Using Excel**

So... if the odds of a hash collision increase with each new block in an environment, how big of an environment must we have before the odds of having a hash collision are unacceptable? The first thing we did was to convert the simple formula in the Wikipedia entry into a formula in an Excel spreadsheet. (You can download the spreadsheet I used here .) Where F8 is the number of blocks in an environment, and F5 is the number of keys in the key space, the formula for the probability of a hash collision in a given environment is 1-EXP((-(F8^2))/(2*F5)).

The number of blocks in a given environment is a function of how many TB of disk they have, divided by the average chunk size of the de-duplication system. (Such systems use chunk sizes ranging from 8KB to 128KB.) The number of keys in the key space is determined by the hashing algorithm. SHA-1 (the most common hash) is a 160-bit hash, so this number is 2^160.

Unfortunately, Excel only calculates things to 30 decimal places. In order to have anything other than 0.000000000000000000000000000000% register in the calculated odds field, you need to have 71 quadrillion (70,921,000,000,000,000) blocks. At that point, the odds will change to 0.000000000000177635683940025000%. To have that many blocks, you need 530 EB with 8 KB chunks, 1.1 ZB (Zettabytes) with 16 KB chunks, 1.6 ZB with 24 KB chunks, and 8.4 ZB with 128 KB chunks. Anything less than that, and the odds are zero.

**Use PHP**

I wanted a little more precision, so I converted the Excel formula into a PHP formula, since it calculates up to 50 decimal places:

$value=1 - exp((-(pow($blocks,2)))/(2 * $keyspace));

$value=$value * 100;

$num=number_format($value,50);

print "$num\n";

Using this formula, I discovered that I had to have at least 12.8 quadrillion blocks to get the odds to appear with 50 decimal places. Anything less than 12.8 quadrillion blocks, and your odds are 0.00000000000000000000000000000000000000000000000000%. With 12.8 quadrillion blocks, your odds are 0.00000000000001110223024625156540423631668090820313%. To have 12.8 quadrillion blocks, you need 95 EB with 8 KB chunks, 190 EB with 16 KB chunks, or 290 EB with 24 KB chunks.

In comparison, the odds of having an undetected bit error rate every time you write to an LTO tape drives are 0.000000000000000000000000100000%, or 1 in 10^(27).

**The Results**

So, there you have it. If you use the product with the smallest chunk size (making your odds worse), and you've got 95 Exabytes of data, your odds of having **a single hash collision **are slightly higher than the odds that you face every time you use an LTO tape. (And we haven't applied the Birthday Paradox to LTO, either.)

Put another way, if you have 95 EB of data, you have a 0.00000000000001110223024625156540423631668090820313% chance that you have will your system will discard a block of data that it should keep. The only thing that will effect is a restore that requires that particular block of data. Anyone want to calculate the odds that the block of data that got corrupted actually ends up being needed in a restore?

And if you have something less than 95 EB of data (and you know you do), then your odds don't appear in 50 decimal places.

I think I'm still OK with these odds.

**This blog entry is one in a series on deduplication. The previous post was about my podcast, and the next post is about the maturity of deduplication . **

## Comments

12I like the way this maths works out. I've just been thru the Dedupe VTL vendors and was concerned that rather than getting IBM or Sepaton I'm getting a, hash only, DD ( Dont they call DD appliances fast breeders ). I'm now comfortable that the massive hash collision risk I'm being exposed to is actually trillions of times less likely than the risk of the plane landing on the Data Center.

BTW Mboot the reactor shutdown sequence you need is on the back of my Coffee coaster, You have to tranlate it from French to English.

BOOM! oops you must have used a French to American dictionary.

HashKey Length is ther 128 Bytes (gives 2^1024 different possible bit combinations).

Data area length for every hashkey is typically 24 KBytes

- that means 24576 Bytes and so 2^196608 different possible bit combinations

That means behind every single hashkey there can be 2^195.584 different bitcombinations in the data area (2^196608 - 2^1024).

Looking positively to the risk: 2^1024 is a really very small chance to get a duplicate hashkey.

Looking negatively to the risk: 2^195584 is a more than astronomic small chance, that any hashkey should represent the same content.

I full agree, that in real world in nearly every case, duplicated stored or backedup data will be the reason to get same hashkeys.

But, if there is a collision, we run into "silent data loss":

- any deduped-storage will loose information for that part

- you can do hundreds of full-backups to a deduped storage and after that hundreds of tape-copies from that. That may not help: the data for that part are lost.

Any risk for multibit-errors on PTape are a different not comparable thing from my point of view: you will get a good restore result again, when you restore that item from another backup-media (which may be one day older).

There seems to be a little mistake inside that: a higher dedupe rate must result in a higher risk ("no dedupe = no risk"-).

Also, I think, we trust that the algoritm (may be MD5) is absolutely perfect for all data in this world to create a maximum of different hashes.

I am no expert about that, but I see here also an additonal risk to get much more duplicates then theoretical expected.

It would be a nice idea to give the calculation sheet to experienced mathematicians to proof it in deep.

Ignoring that all, I tested little bit round with the existing EXCEL-Calculation.

My result is, the risk can dramatically increase, depending on the hash size and data block size.

For example, if there would be a system with:

- hash size 4 Bytes

- block size 4096 Bytes

it results: nearly every GB we will have a hash key collision (99,76%).

So, it is not clear for any dedupe algorithm on the market, how high the risk is, without knowing at least the algorithm (may be MD5), the hash size and the average data block size.

It is only clear, that there is a little risk, which can be avoided, if a dedupe algorithm does a verification.

But this verification costs performance.

My personal view is:

- for data center backups with data of high worth (may be a ERP-Database)

I would like to avoid this risk (better have some less performance).

- for relatively unworthy data backups over thin wires,

I would like to accept this risk

(because it may be the only chance to have a backup at all

- and I would like to get the great benefit for the backupspeed

and low traffic over the WAN-wire).

May be a future dedupe algorithm gives us the choice to offer both (verification or not), depending on

- an adjustable backup-policy (data needs)

- or the personal character of the IT-Stuff

(Gamblers and risk-averse guys may be fine with that).

- Unfortunately there seems to be no product, which gives the customer that chance.

Such a system can be immune to hash collisions by performing an initial hash lookup. On a match, it must then do a byte-by-byte comparison of the input data against the existing block to check that they are identical.

The penalty is that the byte-by-byte check costs time and I/O, so the ingest rate of the system will suffer.

Also, remember that only some vendors are hash-only. If the odds of hash collisions bother you, don't give up on de-dupe. There are plenty of vendors that don't use hashes.

12RSS feed for comments to this post