Welcome! » Log In » Create A New Profile

Interesting write-up of compare-by-hash.

Posted by Anonymous 
Interesting write-up of compare-by-hash.
January 27, 2004 06:34PM
[quote][quote][quote][quote][quote]Greg Freemyer <freemyer-ml < at > NorcrossGroup.com>
wrote the following on Fri, 02 Jan 2004 14:39:18 -0500
[/quote][/quote][/quote][/quote][/quote]
[quote]The writeup at:

http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/hash.html

in effect says that 'compare-by-hash' is not the perfect choice for
a rdiff-backup type of program. Instead, it says some kind of state
information should be added to the algorithym to ensure accuracy.

I don't know that I am concerned, but I am curious if rdiff-backup
takes any precautions against hash-collisions.
[/quote]
No, rdiff-backup takes no special precautions. As far as I can tell,
the paper makes the point that even cryptographically strong hashes
may be breakable in the future. This is a pretty good objection,
especially since librsync uses md5 instead of the stronger SHA1.

But in rdiff-backup the most an attacker could do would be to produce
a file that had the same signature, causing rdiff-backup to skip over
parts of the file. But an attacker can already do much better than
that: just use utime(), chmod, and such to make it look like your file
hasn't changed. Plus this method doesn't take years of research.

So anyway, rdiff-backup isn't intended to defeat an attacker trying to
make it look like his files haven't changed.

I did find this quote from the article interesting:

[quote]BitKeeper keeps state about each file under source control and knows
what changes have been made since the last time each tree was
synchronized. When synchronizing, it sends only the differences
since the last synchronization occurred, in compressed form. In
comparison to rsync, BitKeeper provides similar and sometimes better
bandwidth usage when simply synchronizing two trees without
resorting to compare-by-hash.
[/quote]
Does anyone know how BitKeeper can send only the differences without
using a compare-by-hash algorithm? I don't see how this is possible
unless BitKeeper has abandoned the usual edit w/arbitrary editor and
then checkin scheme.

--
Ben Escoto
Interesting write-up of compare-by-hash.
January 28, 2004 01:33AM
On Tue, Jan 27, 2004 at 06:33:57PM -0800, Ben Escoto wrote:
[quote]Does anyone know how BitKeeper can send only the differences without
using a compare-by-hash algorithm? I don't see how this is possible
unless BitKeeper has abandoned the usual edit w/arbitrary editor and
then checkin scheme.
[/quote]
IIRC BitKeeper uses a "explicit mark before editing each file" scheme
(a clean checkout marks all files as read-only), and I think it keeps
a clean local copy of the source tree as well.

That paper has some problems as well, see e.g. the general critiques
at
http://www.venge.net/monotone/docs/Hash-Integrity.html

-- Nathaniel

--
"The problem...is that sets have a very limited range of
activities -- they can't carry pianos, for example, nor drink
beer."
Interesting write-up of compare-by-hash.
January 28, 2004 11:21AM
On Tue, 2004-01-27 at 21:33, Ben Escoto wrote:
[quote][quote][quote][quote][quote][quote]Greg Freemyer <freemyer-ml < at > NorcrossGroup.com>
wrote the following on Fri, 02 Jan 2004 14:39:18 -0500
[/quote][/quote][/quote][/quote][/quote]
[quote]The writeup at:

http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/hash.html

in effect says that 'compare-by-hash' is not the perfect choice for
a rdiff-backup type of program. Instead, it says some kind of state
information should be added to the algorithym to ensure accuracy.

I don't know that I am concerned, but I am curious if rdiff-backup
takes any precautions against hash-collisions.
[/quote]
No, rdiff-backup takes no special precautions. As far as I can tell,
the paper makes the point that even cryptographically strong hashes
may be breakable in the future. This is a pretty good objection,
especially since librsync uses md5 instead of the stronger SHA1.

But in rdiff-backup the most an attacker could do would be to produce
a file that had the same signature, causing rdiff-backup to skip over
parts of the file. But an attacker can already do much better than
that: just use utime(), chmod, and such to make it look like your file
hasn't changed. Plus this method doesn't take years of research.

So anyway, rdiff-backup isn't intended to defeat an attacker trying to
make it look like his files haven't changed.

[/quote]My impression was the write-up was talking about an inadvertant
hash-collision, not hackers.

Since the hashing process is lossy (ie. non-reversable), then it is
possible that two totally different data sets could generate the same
hash, and in turn invalidate the backup checksum check.

I don't know what the odds are of this happening with rdiff-backup.

I assume that they are exceedingly small, but not zero.

[quote]
I did find this quote from the article interesting:

[quote]BitKeeper keeps state about each file under source control and knows
what changes have been made since the last time each tree was
synchronized. When synchronizing, it sends only the differences
since the last synchronization occurred, in compressed form. In
comparison to rsync, BitKeeper provides similar and sometimes better
bandwidth usage when simply synchronizing two trees without
resorting to compare-by-hash.
[/quote]
Does anyone know how BitKeeper can send only the differences without
using a compare-by-hash algorithm? I don't see how this is possible
unless BitKeeper has abandoned the usual edit w/arbitrary editor and
then checkin scheme.

[/quote]I don't how bitkeeper does it, but I am familiar with drbd. A network
based disk mirroring solution.

With drbd, if the network/cluster is up, then all writes are mirrored
immediately.

If the network is down, a dirty table is maintained. Anytime a write is
done, the dirty table is flagged.

Then, when communication is re-established only the dirty blocks are
syncronized.

LVM snapshots also maintain a dirty block table (or Copy-On-Write
table).

It is obviously very non-trivial, but the ultimate way for a
rdiff-backup tool to work would be to somehow tie into a block level
dirty table. The table would then be cleared when the block is backed
up.

As I said, LVM already has the concept of a dirty block table, so it is
not a totally crazy idea.

Anybody need a doctorial thesis project?

Greg
--
Greg Freemyer
Interesting write-up of compare-by-hash.
January 28, 2004 02:16PM
Replying to Ben Escoto:
[quote][quote]BitKeeper keeps state about each file under source control and knows
what changes have been made since the last time each tree was
synchronized. When synchronizing, it sends only the differences
since the last synchronization occurred, in compressed form. In
comparison to rsync, BitKeeper provides similar and sometimes better
bandwidth usage when simply synchronizing two trees without
resorting to compare-by-hash.
[/quote]
Does anyone know how BitKeeper can send only the differences without
using a compare-by-hash algorithm? I don't see how this is possible
unless BitKeeper has abandoned the usual edit w/arbitrary editor and
then checkin scheme.
[/quote]
I know. bk is a scm. It stores data in sequential format, change by
change. checkpoints is placed between checkins. so, on bk pulls we are
searching for changeset key in history, backwards from present, and
then applying them all in sequential order.

It should be compared with cvs rather than rsync.

--
Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key
This message represents the official view of the voices in my head
Interesting write-up of compare-by-hash.
January 28, 2004 05:33PM
[quote]My impression was the write-up was talking about an inadvertant
hash-collision, not hackers.

Since the hashing process is lossy (ie. non-reversable), then it is
possible that two totally different data sets could generate the same
hash, and in turn invalidate the backup checksum check.

I don't know what the odds are of this happening with rdiff-backup.
[/quote]
The presence or absence of collision is a property of the hash function
used - sha1, md5, whatever - and not 'rdiff-backup'. Actually, any
hash function will have infinitely many collisions. The proof is an
exercise for the reader ...

[quote]
I assume that they are exceedingly small, but not zero.
[/quote]
Over the entire domain of a hash function, the probability is more
like one than zero. Commonly used hashes are commonly used because they
have infrequent collision over their practical domain. That's an empirical
observation, however, not established, as far as I know, by any sort
of proof.

David S.

[quote][/quote]
Interesting write-up of compare-by-hash.
January 28, 2004 06:28PM
David S. writes:
[quote][quote]My impression was the write-up was talking about an inadvertant
hash-collision, not hackers.

Since the hashing process is lossy (ie. non-reversable), then it is
possible that two totally different data sets could generate the same
hash, and in turn invalidate the backup checksum check.

I don't know what the odds are of this happening with rdiff-backup.
[/quote]
The presence or absence of collision is a property of the hash function
used - sha1, md5, whatever - and not 'rdiff-backup'. Actually, any
hash function will have infinitely many collisions. The proof is an
exercise for the reader ...

[quote]
I assume that they are exceedingly small, but not zero.
[/quote]
Over the entire domain of a hash function, the probability is more
like one than zero. Commonly used hashes are commonly used because they
have infrequent collision over their practical domain. That's an empirical
observation, however, not established, as far as I know, by any sort
of proof.

[/quote]
No, it is very small and typically easy to estimate. Most hash functions are
designed to be close to uniform, i. e. the frequency of any particular hash
value is the same as all others. The result is that for e.g. MD5 which
produces a 128-bit integer the probability that two messages chosen at random
will have the same hash is very close to 2^-128 ~ 3 * 10^-39.

--
Robert L. Knighten
Robert < at > Knighten.org
Interesting write-up of compare-by-hash.
January 30, 2004 02:10PM
[quote][quote][quote][quote][quote]Greg Freemyer <freemyer-ml < at > NorcrossGroup.com>
wrote the following on Wed, 28 Jan 2004 14:08:05 -0500
[/quote][/quote][/quote][/quote][/quote]
[quote]Since the hashing process is lossy (ie. non-reversable), then it is
possible that two totally different data sets could generate the same
hash, and in turn invalidate the backup checksum check.

I don't know what the odds are of this happening with rdiff-backup.

I assume that they are exceedingly small, but not zero.
[/quote]
For rdiff (and thus rdiff-backup) it actually depends on the number of
blocks in the file, because there is no global sha1 or md5 hash. So a
2GB file that has all 2GB changed is more likely to cause a hash
collision than a changed 1k file.

I remember asking Donovan Baarda about this on the librsync list a while
ago, so if anyone is curious for more details they can look that up. The
upshot IIRC is that (for "random data") the odds of a collision are
around 2^-50 even for fairly large files. This isn't as good as a 128
global hash, but quite reasonable for practical use.

--
Ben Escoto
Sorry, only registered users may post in this forum.

Click here to login