Hash Collisions: The Real Odds

A lot of FUD has been passed around lately about the probability of a hash collision when using a hash-only de-duplication system.  What is a hash-only de-dupe system?  What is the real probability of a hash collision in such a system?  Read on to find out.

This blog entry is one in a series on deduplication.  The previous post was about my podcast, and the next post is about the maturity of deduplication .

In a recent very long thread (with 63 posts) on the NetBackup Mailing List, someone took issue with a quote of mine from my January article, "The Skinny on Deduplication."  In that article, I pointed out that the odds of having two different blocks of data have the same hash (known as a hash collision) are 1:2^160, which is an astronomical number. They said that what's important is the probability of a hash collision in a given environment, and those odds increase with the size of the environment.  They told me to read up on the Birthday Paradox to see what they're talking about.  I did, and now I'm back.  So here goes.

Hash-only De-dupe systems

First, whatever your thoughts on hash collisions, the concerns about them only apply to de-duplication systems that use only a hash to identify redundant data.  Other vendors, who use either a secondary check to verify a match, or who don't use hashes at all, don't have to worry about hash collisions.  We maintain a list of de-dupe vendors and what methods they use to identify redundant data in our Wiki.

If a vendor uses only a hash to identify redundant data, I think it's perfectly reasonable to discuss the probability of hash collisions.  Trust me, they've had the question before.  But just what are the odds?

Birthday Paradox 

First, let's discuss the Birthday Paradox, which isn't really a paradox.  It's just a way to illustrate how probabilities aren't quite what they may seem.  With a "key space" of 365 days in a year, the probability of any two people having the same birthday are 1:365.  (This is analogous to me saying that the probability of two different blocks having the same hash are 1:2^160 if you have a key space of 2^160.) But, what are the odds that two people in a room have the same birthday?  The more people there are in the room, the higher the probability.  There actually only needs to be 23 people in the room to increase the odds to 50%!  That doesn't seem right, but read this Wikipedia entry to understand how that works.  Applying that logic to the hash collision question, the odds that a given company will have a hash collision somewhere in their environment increases with the size of their environment (analogous to the number of people in a room increasing the chances that two of them will have the same birthday.) 

Using Excel 

So… if the odds of a hash collision increase with each new block in an environment, how big of an environment must we have before the odds of having a hash collision are unacceptable?  The first thing we did was to convert the simple formula in the Wikipedia entry into a formula in an Excel spreadsheet.  (You can download the spreadsheet I used here .)  Where F8 is the number of blocks in an environment, and F5 is the number of keys in the key space, the formula for the probability of a hash collision in a given environment is 1-EXP((-(F8^2))/(2*F5)). 

The number of blocks in a given environment is a function of how many TB of disk they have, divided by the average chunk size of the de-duplication system.  (Such systems use chunk sizes ranging from 8KB to 128KB.)  The number of keys in the key space is determined by the hashing algorithm.  SHA-1 (the most common hash) is a 160-bit hash, so this number is 2^160.

Unfortunately, Excel only calculates things to 30 decimal places.  In order to have anything other than 0.000000000000000000000000000000% register in the calculated odds field, you need to have 71 quadrillion (70,921,000,000,000,000) blocks.  At that point, the odds will change to 0.000000000000177635683940025000%.  To have that many blocks, you need 530 EB with 8 KB chunks, 1.1 ZB (Zettabytes) with 16 KB chunks, 1.6 ZB with 24 KB chunks, and 8.4 ZB with 128 KB chunks.  Anything less than that, and the odds are zero.

Use PHP 

I wanted a little more precision, so I converted the Excel formula into a PHP formula, since it calculates up to 50 decimal places:

$value=1 – exp((-(pow($blocks,2)))/(2 * $keyspace));
$value=$value * 100;
$num=number_format($value,50);
print "$numn";

Using this formula, I discovered that I had to have at least 12.8 quadrillion blocks to get the odds to appear with 50 decimal places.  Anything less than 12.8 quadrillion blocks, and your odds are 0.00000000000000000000000000000000000000000000000000%.  With 12.8 quadrillion blocks, your odds are 0.00000000000001110223024625156540423631668090820313%.  To have 12.8 quadrillion blocks, you need 95 EB with 8 KB chunks, 190 EB with 16 KB chunks, or 290 EB with 24 KB chunks.

In comparison, the odds of having an undetected bit error rate every time you write to an LTO tape drives are 0.000000000000000000000000100000%, or  1 in 10^(27).

The Results 

So, there you have it.  If you use the product with the smallest chunk size (making your odds worse), and you've got 95 Exabytes of data, your odds of having a single hash collision are slightly higher than the odds that you face every time you use an LTO tape.  (And we haven't applied the Birthday Paradox to LTO, either.) 

Put another way, if you have 95 EB of data, you have a 0.00000000000001110223024625156540423631668090820313% chance that you have will your system will discard a block of data that it should keep.  The only thing that will effect is a restore that requires that particular block of data.  Anyone want to calculate the odds that the block of data that got corrupted actually ends up being needed in a restore?

And if you have something less than 95 EB of data (and you know you do), then your odds don't appear in 50 decimal places.

I think I'm still OK with these odds.

This blog entry is one in a series on deduplication.  The previous post was about my podcast, and the next post is about the maturity of deduplication .

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

13 comments
  • A few years ago I played with similar ideas an made a proof of concept for a backup program based on minimal storage. (http://www.inmultiplecontexts.org/mboot.org/ideas/amrta)

    In order to minimize the risc of hash collision I put in 2 different hash functions md5 and sha1.

    I hold the opinion that a backup program MUST be 100% foolproof in its internal assumptions. This means that even the possibility (however remote) of a restore generating different data from the original file MUST be excluded. It could be your paycheck, your medical records or the correct code to stop a nuclear explosion from happening, these things should be recoverable without question. I am mathematically under educated to know or calculate the risc of hash collision but that is beside the point. I expect my backups to be restorable at all times without the possibility or any changes happening in the data silently and beyond control.

    Maarten

  • yes, but there is a risk in everything you use.

    LTO (as in the post) tapes have a risk, you can copy each tape twice or more, but that risk will never be gone.

    this is the same for each media data is stored on. you need to make the risk small enough that it becomes acceptable.

  • And the risks of a hash collision using both md5 and sha1 are roughly the odds of a hash collision in one multiplied times the odds of a hash collision in another.

    And he’s right about tape. People think of it as foolproof in comparison to dedupe, but that would be incorrect.

  • The one thing that does scare me about de-dupe: I won’t know I have a problem. If a tape fails I may not be happy about it, but at least I know the data is bad. If I have a hash collision, I might restore the payroll database and declare “mission accomplished”, only to have one of the chunks be invalid due to a hash collision. The risk to the data may be similar, but the faith in the backup system is out the window faster than you can say “you want fries with that?”

  • I think that’s a perfectly valid point, but I remind you that the same exact scenario exists with tape. Not all tape failures are detected or reported. I’ve never met anyone who had a hash collision, but I can tell you of more than one restore I’ve personally done from tape that said it was fine, only to have the user tell me the restored data was not usable. It happens all the time.

    Also, remember that only some vendors are hash-only. If the odds of hash collisions bother you, don’t give up on de-dupe. There are plenty of vendors that don’t use hashes.

  • I am in complete agreement, and am working to get de-dupe running in our environment (more of an aquisition cost issue than a FUD problem right now- what you own always looks cheaper than what you need to buy). I just like stirring the pot. ๐Ÿ˜ˆ

  • I understand the comment tburrell and I agree with what you say mostly. If you have an error on tape while writing the backup typically you will know. If you get the best dedupe solution you will also know if there is a problem when writing to disk. The problem with tape is you are never certain your safe. You can reread the tape and verify the data – but almost every tape gets rewound when done. How do you know that a problem didn’t come into play at the end of the rewind – maybe it stretches a bit, etc. At the end of the day I feel a lot safer having my data protected by a dedupe solution on disk. Just wanted to share my thoughts.

  • One thing I think gets lost, and the point I was really trying to make, is that a lot of these technologies are really at the point where the techincal hurdles aren’t the problem. Most of us in the field are quite aware of the limits of even the most tried-and-true technologies. The problem comes with less-conversant management types- especially outside of IT. New technology can scare them pretty easily, and the first hint of a problem sends them scurrying to the familiar, if not safer, surroundings of tape. Witness the “Presto” gizmo- lets you print out e-mail without a computer- just a printer. Which also happens to be the most trouble-prone area of most networks I’ve worked with. I actually got into backups to get out of printer admin. But it’s seen as “easier” because you get ink on paper, a more familiar place. Just a natural human reaction to retreat to the familiar in case of trouble. The sheer number of options in the De-dupe space is probably the best argument I see right now for those folks- if Windows Home server can do it, it can’t be rocket science anymore, right?

  • A dedupe system that uses hashes probably uses the hashes only to determine if a given block already exists. It probably does not use the hash to address the block when it is stored.

    Such a system can be immune to hash collisions by performing an initial hash lookup. On a match, it must then do a byte-by-byte comparison of the input data against the existing block to check that they are identical.

    The penalty is that the byte-by-byte check costs time and I/O, so the ingest rate of the system will suffer.

  • Right… Except none of the "big-hash" (i.e. SHA-1) vendors (which is what the blog entry was about) DO a bit-level check after the hash. There just aren’t enough computing cycles left after you’ve calculated a hash that big. If you do something less computationally expensive than a SHA-1 hash, then maybe you have enough compute cycles to do a bit-level comparison. For example, both Diligent & SEPATON do bit-level comparison — but they don’t use hashing.

  • I played little round with the excel sheet to calculate the hash key collision risk.

    There seems to be a little mistake inside that: a higher dedupe rate must result in a higher risk ("no dedupe = no risk"-).

    Also, I think, we trust that the algoritm (may be MD5) is absolutely perfect for all data in this world to create a maximum of different hashes.
    I am no expert about that, but I see here also an additonal risk to get much more duplicates then theoretical expected.

    It would be a nice idea to give the calculation sheet to experienced mathematicians to proof it in deep.

    Ignoring that all, I tested little bit round with the existing EXCEL-Calculation.
    My result is, the risk can dramatically increase, depending on the hash size and data block size.

    For example, if there would be a system with:
    – hash size 4 Bytes
    – block size 4096 Bytes
    it results: nearly every GB we will have a hash key collision (99,76%).

    So, it is not clear for any dedupe algorithm on the market, how high the risk is, without knowing at least the algorithm (may be MD5), the hash size and the average data block size.

    It is only clear, that there is a little risk, which can be avoided, if a dedupe algorithm does a verification.

    But this verification costs performance.

    My personal view is:
    – for data center backups with data of high worth (may be a ERP-Database)
    I would like to avoid this risk (better have some less performance).
    – for relatively unworthy data backups over thin wires,
    I would like to accept this risk
    (because it may be the only chance to have a backup at all
    – and I would like to get the great benefit for the backupspeed
    and low traffic over the WAN-wire).

    May be a future dedupe algorithm gives us the choice to offer both (verification or not), depending on
    – an adjustable backup-policy (data needs)
    – or the personal character of the IT-Stuff
    (Gamblers and risk-averse guys may be fine with that).
    – Unfortunately there seems to be no product, which gives the customer that chance.

  • Looked later to the point again, after getting the length of hashkey and data range of one dedupe solution on the market, which works with the collision risk (I don’t like to tell, wich product and which SE gave me the numbers).

    HashKey Length is ther 128 Bytes (gives 2^1024 different possible bit combinations).

    Data area length for every hashkey is typically 24 KBytes
    – that means 24576 Bytes and so 2^196608 different possible bit combinations

    That means behind every single hashkey there can be 2^195.584 different bitcombinations in the data area (2^196608 – 2^1024).

    Looking positively to the risk: 2^1024 is a really very small chance to get a duplicate hashkey.

    Looking negatively to the risk: 2^195584 is a more than astronomic small chance, that any hashkey should represent the same content.

    I full agree, that in real world in nearly every case, duplicated stored or backedup data will be the reason to get same hashkeys.

    But, if there is a collision, we run into "silent data loss":
    – any deduped-storage will loose information for that part
    – you can do hundreds of full-backups to a deduped storage and after that hundreds of tape-copies from that. That may not help: the data for that part are lost.

    Any risk for multibit-errors on PTape are a different not comparable thing from my point of view: you will get a good restore result again, when you restore that item from another backup-media (which may be one day older).

  • I love the fact that LTO4 is more likely to screw me over. I use 3592’s.

    I like the way this maths works out. I’ve just been thru the Dedupe VTL vendors and was concerned that rather than getting IBM or Sepaton I’m getting a, hash only, DD ( Dont they call DD appliances fast breeders ). I’m now comfortable that the massive hash collision risk I’m being exposed to is actually trillions of times less likely than the risk of the plane landing on the Data Center.

    BTW Mboot the reactor shutdown sequence you need is on the back of my Coffee coaster, You have to tranlate it from French to English.

    BOOM! oops you must have used a French to American dictionary.