<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Hash Collisions: The Real Odds</title>
		<description>Discuss Hash Collisions: The Real Odds</description>
		<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html</link>
		<lastBuildDate>Fri, 10 Feb 2012 21:00:52 +0000</lastBuildDate>
		<generator>JComments</generator>
		<atom:link href="http://www.backupcentral.com/component/jcomments/feed/com_content/145/10.html" rel="self" type="application/rss+xml" />
		<item>
			<title>Stephen H  Carter says:</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-1290</link>
			<description><![CDATA[I love the fact that LTO4 is more likely to screw me over. I use 3592's. I like the way this maths works out. I've just been thru the Dedupe VTL vendors and was concerned that rather than getting IBM or Sepaton I'm getting a, hash only, DD ( Dont they call DD appliances fast breeders ). I'm now comfortable that the massive hash collision risk I'm being exposed to is actually trillions of times less likely than the risk of the plane landing on the Data Center. BTW Mboot the reactor shutdown sequence you need is on the back of my Coffee coaster, You have to tranlate it from French to English. BOOM! oops you must have used a French to American dictionary.]]></description>
			<dc:creator>Stephen H  Carter</dc:creator>
			<pubDate>Wed, 13 Jul 2011 20:23:29 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-1290</guid>
		</item>
		<item>
			<title>tape-risk is not the same as silent data loss</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-608</link>
			<description><![CDATA[Looked later to the point again, after getting the length of hashkey and data range of one dedupe solution on the market, which works with the collision risk (I don't like to tell, wich product and which SE gave me the numbers). HashKey Length is ther 128 Bytes (gives 2^1024 different possible bit combinations). Data area length for every hashkey is typically 24 KBytes - that means 24576 Bytes and so 2^196608 different possible bit combinations That means behind every single hashkey there can be 2^195.584 different bitcombinations in the data area (2^196608 - 2^1024). Looking positively to the risk: 2^1024 is a really very small chance to get a duplicate hashkey. Looking negatively to the risk: 2^195584 is a more than astronomic small chance, that any hashkey should represent the same content. I full agree, that in real world in nearly every case, duplicated stored or backedup data will be the reason to get same hashkeys. But, if there is a collision, we run into &#34;silent data loss&#34;: - any deduped-storage will loose information for that part - you can do hundreds of full-backups to a deduped storage and after that hundreds of tape-copies from that. That may not help: the data for that part are lost. Any risk for multibit-errors on PTape are a different not comparable thing from my point of view: you will get a good restore result again, when you restore that item from another backup-media (which may be one day older).]]></description>
			<dc:creator>Dieter Unterseher</dc:creator>
			<pubDate>Fri, 08 May 2009 06:14:38 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-608</guid>
		</item>
		<item>
			<title>Hash Key Collision Risk Thoughts</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-262</link>
			<description><![CDATA[I played little round with the excel sheet to calculate the hash key collision risk. There seems to be a little mistake inside that: a higher dedupe rate must result in a higher risk (&#34;no dedupe = no risk&#34;-). Also, I think, we trust that the algoritm (may be MD5) is absolutely perfect for all data in this world to create a maximum of different hashes. I am no expert about that, but I see here also an additonal risk to get much more duplicates then theoretical expected. It would be a nice idea to give the calculation sheet to experienced mathematicians to proof it in deep. Ignoring that all, I tested little bit round with the existing EXCEL-Calculation. My result is, the risk can dramatically increase, depending on the hash size and data block size. For example, if there would be a system with: - hash size 4 Bytes - block size 4096 Bytes it results: nearly every GB we will have a hash key collision (99,76%). So, it is not clear for any dedupe algorithm on the market, how high the risk is, without knowing at least the algorithm (may be MD5), the hash size and the average data block size. It is only clear, that there is a little risk, which can be avoided, if a dedupe algorithm does a verification. But this verification costs performance. My personal view is: - for data center backups with data of high worth (may be a ERP-Database) I would like to avoid this risk (better have some less performance). - for relatively unworthy data backups over thin wires, I would like to accept this risk (because it may be the only chance to have a backup at all - and I would like to get the great benefit for the backupspeed and low traffic over the WAN-wire). May be a future dedupe algorithm gives us the choice to offer both (verification or not), depending on - an adjustable backup-policy (data needs) - or the personal character of the IT-Stuff (Gamblers and risk-averse guys may be fine with that). - Unfortunately there seems to be no product, which gives the customer that chance.]]></description>
			<dc:creator>Dieter Unterseher</dc:creator>
			<pubDate>Fri, 08 May 2009 06:13:23 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-262</guid>
		</item>
		<item>
			<title>Right, except....</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-187</link>
			<description><![CDATA[Right... Except none of the &#34;big-hash&#34; (i.e. SHA-1) vendors (which is what the blog entry was about) DO a bit-level check after the hash. There just aren't enough computing cycles left after you've calculated a hash that big. If you do something less computationally expensive than a SHA-1 hash, then maybe you have enough compute cycles to do a bit-level comparison. For example, both Diligent & SEPATON do bit-level comparison -- but they don't use hashing.]]></description>
			<dc:creator>W. Curtis Preston</dc:creator>
			<pubDate>Fri, 11 Jan 2008 12:10:14 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-187</guid>
		</item>
		<item>
			<title>Dedupe can be immune to data loss from hash collis</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-186</link>
			<description><![CDATA[A dedupe system that uses hashes probably uses the hashes only to determine if a given block already exists. It probably does not use the hash to address the block when it is stored. Such a system can be immune to hash collisions by performing an initial hash lookup. On a match, it must then do a byte-by-byte comparison of the input data against the existing block to check that they are identical. The penalty is that the byte-by-byte check costs time and I/O, so the ingest rate of the system will suffer.]]></description>
			<dc:creator>geoffna</dc:creator>
			<pubDate>Fri, 11 Jan 2008 04:01:57 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-186</guid>
		</item>
		<item>
			<title>I should clarify</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-171</link>
			<description><![CDATA[One thing I think gets lost, and the point I was really trying to make, is that a lot of these technologies are really at the point where the techincal hurdles aren't the problem. Most of us in the field are quite aware of the limits of even the most tried-and-true technologies. The problem comes with less-conversant management types- especially outside of IT. New technology can scare them pretty easily, and the first hint of a problem sends them scurrying to the familiar, if not safer, surroundings of tape. Witness the "Presto" gizmo- lets you print out e-mail without a computer- just a printer. Which also happens to be the most trouble-prone area of most networks I've worked with. I actually got into backups to get out of printer admin. But it's seen as "easier" because you get ink on paper, a more familiar place. Just a natural human reaction to retreat to the familiar in case of trouble. The sheer number of options in the De-dupe space is probably the best argument I see right now for those folks- if Windows Home server can do it, it can't be rocket science anymore, right?]]></description>
			<dc:creator>tburrell</dc:creator>
			<pubDate>Mon, 12 Nov 2007 13:12:10 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-171</guid>
		</item>
		<item>
			<title>Do you really know of a tape f</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-167</link>
			<description><![CDATA[I understand the comment tburrell and I agree with what you say mostly. If you have an error on tape while writing the backup typically you will know. If you get the best dedupe solution you will also know if there is a problem when writing to disk. The problem with tape is you are never certain your safe. You can reread the tape and verify the data - but almost every tape gets rewound when done. How do you know that a problem didn't come into play at the end of the rewind - maybe it stretches a bit, etc. At the end of the day I feel a lot safer having my data protected by a dedupe solution on disk. Just wanted to share my thoughts.]]></description>
			<dc:creator>scottcorp</dc:creator>
			<pubDate>Thu, 08 Nov 2007 06:25:32 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-167</guid>
		</item>
		<item>
			<title>Don\'t get me wrong...</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-165</link>
			<description><![CDATA[I am in complete agreement, and am working to get de-dupe running in our environment (more of an aquisition cost issue than a FUD problem right now- what you own always looks cheaper than what you need to buy). I just like stirring the pot. :twisted:]]></description>
			<dc:creator>tburrell</dc:creator>
			<pubDate>Mon, 29 Oct 2007 15:06:58 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-165</guid>
		</item>
		<item>
			<title>Same thing happens with tape</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-163</link>
			<description><![CDATA[I think that's a perfectly valid point, but I remind you that the same exact scenario exists with tape. Not all tape failures are detected or reported. I've never met anyone who had a hash collision, but I can tell you of more than one restore I've personally done from tape that said it was fine, only to have the user tell me the restored data was not usable. It happens all the time. Also, remember that only some vendors are hash-only. If the odds of hash collisions bother you, don't give up on de-dupe. There are plenty of vendors that don't use hashes.]]></description>
			<dc:creator>cpreston</dc:creator>
			<pubDate>Sat, 27 Oct 2007 13:13:42 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-163</guid>
		</item>
		<item>
			<title>It\'s no the odds- it\'s the k</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-156</link>
			<description><![CDATA[The one thing that does scare me about de-dupe: I won't know I have a problem. If a tape fails I may not be happy about it, but at least I know the data is bad. If I have a hash collision, I might restore the payroll database and declare "mission accomplished", only to have one of the chunks be invalid due to a hash collision. The risk to the data may be similar, but the faith in the backup system is out the window faster than you can say "you want fries with that?"]]></description>
			<dc:creator>tburrell</dc:creator>
			<pubDate>Tue, 23 Oct 2007 06:59:01 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html#comment-156</guid>
		</item>
	</channel>
</rss>

