Subscribe to Mailing Lists     FAQFAQ    SearchSearch      Register  Log in to check your private messagesLog in to check your private messages    Log inLog in 
These forums brought to you by Backup Central, where we also have the Mr. Backup Blog, Mailing Lists, FAQs,
and Directories of Backup Software and Hardware
more info on 25gig files

 
Post new topic   Reply to topic    Backup Central Forums Forum Index -> Rdiff-Backup
View previous topic :: View next topic  
Author Message
rsync2eran
Guest





PostPosted: Wed Jul 20, 2005 6:52 pm    Post subject: more info on 25gig files Reply with quote

Hi,

On 06/07/05 23:35, Donovan Baarda wrote:

Quote:
Yeah, but switching to OpenSSL's md4 sum will make it faster, without
making it any weaker.

OpenSSL's MD4 implementation is actually 1.30 times *slower* than its
MD5 implementation on my machine.


Quote:
Quote:
Quote:
Though it someone can demonstrate that a different hash has a
significantly better distribution over random data than md4, then maybe
it would be worth considering (ie, it would avoid accidental collisions
better).

I find this unlikely.

The crucial point is that for librsync, the important part of the hash
used is it's distribution, not it's "strength" (BTW, for those wanting
to convince me otherwise, I think the two are actually related).

Hash functions are usually evaluated rather extensively for their
distribution properties (in effect, and without much formal basis,
they're expected to be good pseudorandom generators). Had there been a
bias large enough to affect rsync noticably, it wouldn't have been
unnoticed.


Quote:
I wasn't that concerned about exotic cryptanalytic attacks, more about
random whole-file checksum collisions... particularly given the fact
that we are trying to use the whole-file checksum as a way of detecting
blocksum collisions... if the whole-file checksum is actually a checksum
of those blocksums...

But the metahash uses the *untruncated* hashes of the blocks, where each
side computes them from the actual data it reads or writes.

Quote:
and that the savings are worth it.

The sender's computational overhead for a straightforward full-file hash
on a mostly unchanged file would be around 20%-50%, depending on the
hash function. The metahash elimintates it.

Eran
Back to top
rsync2eran
Guest





PostPosted: Wed Jul 20, 2005 6:52 pm    Post subject: more info on 25gig files Reply with quote

Hi,

Donovan Baarda wrote:

Quote:
Ahh, you missed the worst bit... the new file is rolled through checking
blocks at each byte boundary, so you don't have n blocks for the new
file, you have z blocks where z is the size of the file in bytes, ie
2^32, not 2^22.

Yes, I had this in mind:

Quote:
Quote:
Quote:
In practice it is not quite that bad... as each match effectively skips
over the matching block's bytes, so it is really one block per
non-matching byte...

But this is not consistent with the assumption that the files are not
simialr (e.g., compressed or encrypted). So for those files the
situation is not distrubring, but rather frightening: a
one-in-a-thousand chance of corruption for the 4GB/1K parameters.

Quote:
Quote:
Quote:
ie the critical thing is not the file size, it's
the number of changed bytes. In the case of a large database file with a
small number of changes, you are usually in the clear.

Yes, for "usually" being lower-bounded by the 1/2^21 chance of an
intra-file collision (for the 4GB/1K parameters.


Quote:
Quote:
Quote:
It has got to be worth something... even if it only counts for 4 bits,
it still gives you 16 x the confidence.

ACK.

what? you don't think 1 in 16 million is better than 1 in 1 million ?

Oh, sure, I acknowedged its being 16 times better!


Quote:
Quote:
Quote:
You may be able to leave handling of signature collisions up to the
application by simply reporting them and giving the application the
oppertunity to confirm and restart or resume the signature calculation.

That's a strange question to ask the application... And often
unanswerable (e.g., for unseekable input).

Yeah, but it's up to the application to decide if it's un-answerable.

OK, just make sure things keep working reliably if the application can't
answer.


Quote:
Quote:
If you don't check those collision within librsync (as above), maybe
it's better to just keep assuming that colliding blocks are identical,
go through with the transfer, and then check the whole-file checksum. If
it doesn't match, tell the application to redo everything with a new
random seed (and perhaps larger hashes, like rsync does automatically).

The point is, if you get a collision in the signature, you may be able
to detect it and recover at signature generation time. For certain
applications this might be much better than finding out months later
when you try to apply the delta.

Hmm, right. That's an important point. I was thinking mostly of on-line
transfers (e.g., http://tromer.org/misc/rexecsync).


Quote:
Quote:
Sounds good, especially the variable seed part. Smile

Speaking of variable seeds, the best way to implement it elegantly and
without relying on a patched MD4 (or whatever) implementation is to use
salt, i.e., by prepending the random string to the file data and running
the usual hash.


Quote:
Quote:
* "rdiff signature" on large cached file: 109MB/sec
* OpenSSL MD5, 1KB blocks: 246MB/sec
8KB blocks: 326MB/sec
* OpenSSL SHA1, 1KB blocks: 148MB/sec
8KB blocks: 174MB/sec


You are not
comparing apples with apples here; the rdiff signature includes rollsums
etc, and probably a crappy md4sum implementation.

My point exactly: it follows that switching to OpenSSL's MD5 won't slow
things down compared to the current code.


Quote:
All the arguments for md5 or sha1 vs md4 revolve around extra protection
against malicious attacks that craft hash collisions. In the case of
librsync, the biggest threat from malicious attacks does not revolve
around vulnerabilities in md4sum, but from the fixed seed. Changing to a
variable seed will solve these,

I don't think RC4 was ever evaluated in the context of variable seeds.
But it is an extremely weak hash function, so I would not be surprised
if it turns out to be susceptible even to that. If the cost in
performance is negligible and compatibility is broken anyway, why not
use something less broken?


Quote:
Though it someone can demonstrate that a different hash has a
significantly better distribution over random data than md4, then maybe
it would be worth considering (ie, it would avoid accidental collisions
better).

I find this unlikely.


Quote:
Also, the "metahash" hash of all the blocksums worries me... it might
not be very reliable.

Merkle hash trees, are quite similar and well-analyzed by the
cryptographic community. In a Merkle hash tree you have a hash function
compressing a 2n-bit string to an n-bit string, and you compute the
n-bit hash of the whole file recursively: first you hash it into half
the size by applying the hash to each aligned 2n-bit block, then to
quarter of the size, etc. It has many wonderful uses. The only essential
difference compared to our meta-hash is that the adversary can affect
the partitioning of the raw data into chuns, but that shouldn't help him
if he can't control the hash of those chunks. But if you want to be
really calm in case of exotic cryptanalytic attacks, when computing the
hash of a chunk represented by a packet, just include the chunk offset
and chunk length in the input to the hash.

Eran
Back to top
rsync2eran
Guest





PostPosted: Wed Jul 20, 2005 6:52 pm    Post subject: more info on 25gig files Reply with quote

One more thing:

Donovan Baarda wrote:
Quote:
My gut feeling is an additionalwhole file hash calculated
alongside the blocksum hashes would not be a significant cost...
particularly compared to the rollsum+hashtable lookups at every byte.

When the files mostly match, "rdiff signature" and "rdiff delta" do
similar things and indeed run almost exactly as fast. So for that case
you can use the benchmarks given earlier to estimate the overhead.

Eran
Back to top
Display posts from previous:   
Post new topic   Reply to topic    Backup Central Forums Forum Index -> Rdiff-Backup All times are GMT - 8 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Magic SEO URL for phpBB