 |
Page 1 of 1
|
| Author |
Message |
rsync2eran
Guest
|
 more info on 25gig files
Hi,
Donovan Baarda wrote:
There was a long discussion on the rsync lists about the best heuristic
for blocksize and blocksum size (rsync also trims the strongsum size to
make the signature smaller, and it was getting blocksum collisions on
large files). I helped figure out the formula's for blocksize and
blocksum size that rsync now uses.
There's a related issue we discussed a while ago:
One thing that's different between rsync and naive use of librsync (as
done by rdiff and, I think, rdiff-backup) is that the latter doesn't
have a strong full-size integrity check. With librsync's default 4-byte
checksums, using a small block size on a large file makes the
probability of silent corruption disturbingly non-negligible.
Eran
|
| Fri Jul 01, 2005 10:44 am |
|
 |
Donovan Baarda
Guest
|
 more info on 25gig files
On Thu, 2005-06-30 at 07:37, rsync2eran < at > tromer.org wrote:
Hi,
Donovan Baarda wrote:
There was a long discussion on the rsync lists about the best heuristic
for blocksize and blocksum size (rsync also trims the strongsum size to
make the signature smaller, and it was getting blocksum collisions on
large files). I helped figure out the formula's for blocksize and
blocksum size that rsync now uses.
There's a related issue we discussed a while ago:
One thing that's different between rsync and naive use of librsync (as
done by rdiff and, I think, rdiff-backup) is that the latter doesn't
have a strong full-size integrity check. With librsync's default 4-byte
checksums, using a small block size on a large file makes the
probability of silent corruption disturbingly non-negligible.
Actually, librsync uses 8 byte strong checksums, which combined with the
4 byte rollsum, gives you 12 bytes of checksum for each block (but you
can only really count on 8 bytes for maliciously crafted data). This
makes it comfortingly negligable unless you have huge files and a
stupidly small blocksize.
The disturbing part is that any corruption that does occur is silently
ignored.
--
Donovan Baarda <abo < at > minkirri.apana.org.au>
|
| Fri Jul 01, 2005 10:44 am |
|
 |
rsync2eran
Guest
|
 more info on 25gig files
On 30/06/05 22:46, Donovan Baarda wrote:
One thing that's different between rsync and naive use of librsync (as
done by rdiff and, I think, rdiff-backup) is that the latter doesn't
have a strong full-size integrity check. With librsync's default 4-byte
checksums, using a small block size on a large file makes the
probability of silent corruption disturbingly non-negligible.
Actually, librsync uses 8 byte strong checksums, which combined with the
4 byte rollsum, gives you 12 bytes of checksum for each block (but you
can only really count on 8 bytes for maliciously crafted data). This
makes it comfortingly negligable unless you have huge files and a
stupidly small blocksize.
Sorry, I meant 8. But as for huge and stupid, how about, say, a 4GB file
and 1K block size? Doesn't seem unreasonable if you're backing up a
large database with many small local changes. That gives you 2^22
blocks, hence 2^44 pairs of blocks, hence (for random data) a
one-in-a-million chance of a corruption.
As for the rollsum, I don't trust it even for non-malicious data. It's
too easy to believe it will be fooled by some natural structure in the
data. And as for malicious settings, well, last time we considered those
we saw corruption can occur with probability 1...
The disturbing part is that any corruption that does occur is silently
ignored.
Yes. Do you think the integrity testing should be done by librsync, or
by every client app (with a prominent notice that they should)?
Eran
|
| Fri Jul 01, 2005 10:44 am |
|
 |
rsync2eran
Guest
|
 more info on 25gig files
Hi,
On 01/07/05 02:31, Donovan Baarda wrote:
The number of pairs for n blocks is n*n/2, or 2^43, so that's a one in
2^21 chance....
You're thinking about colliding block signatures in a single file; I was
thinking about syncing two files that look independently random-looking
(e.g., compressed and/or encrypted). Then you have n blocks for the new
file times n blocks for the old file, so no factor of two.
ie if you have 2 million 4GB files, you are likely to
get a signature md4sum collision in one. That might not be comfortable,
but it is probably not disturbing...
Then it seems our disturbance thresholds are different...
To put it another way, if you have 1000 users running a daily 4GB sync
over 3 years, it's likely that at least one of them will see a
corruption. For increased effect, change 4GB to the subject of this thread.
In practice it is not quite that bad... as each match effectively skips
over the matching block's bytes, so it is really one block per
non-matching byte... ie the critical thing is not the file size, it's
the number of changed bytes. In the case of a large database file with a
small number of changes, you are usually in the clear.
That's because of the heuristic that considers first the next block in
the old file, right? Yes, I guess this does improve things on most
workloads.
As for the rollsum, I don't trust it even for non-malicious data. It's
too easy to believe it will be fooled by some natural structure in the
data. And as for malicious settings, well, last time we considered those
It has got to be worth something... even if it only counts for 4 bits,
it still gives you 16 x the confidence.
ACK.
Yeah, it all goes out the window with malicious data... the only fix for
this is a variable seed for the signature... probably random is best.
Yes, that's if you want to increase the chances of a successful transfer
in malicious settings. Though just to prevent silent corruption, a
strong whole-file hash would suffice.
Note that adding a variable seed allows you to correct a blocksum
collision in the signature... if you detect one, just re-generate the
signature with a different seed.
Note that detecting and correcting this would require seek()
functionality on the input. You can't differentiate a matching blocksum
from a genuine collision or identical block without comparing the actual
block data.
To distinguish genuine collisions from identical blocks, you can keep
stronger hashes in memory (or temporary file) during delta generation.
Requiring seekable input would be quite a loss of functionality (e.g., I
use rdiff to sync huge backup tarballs piped directly from tar).
You may be able to leave handling of signature collisions up to the
application by simply reporting them and giving the application the
oppertunity to confirm and restart or resume the signature calculation.
That's a strange question to ask the application... And often
unanswerable (e.g., for unseekable input).
If you don't check those collision within librsync (as above), maybe
it's better to just keep assuming that colliding blocks are identical,
go through with the transfer, and then check the whole-file checksum. If
it doesn't match, tell the application to redo everything with a new
random seed (and perhaps larger hashes, like rsync does automatically).
The disturbing part is that any corruption that does occur is silently
ignored.
Yes. Do you think the integrity testing should be done by librsync, or
by every client app (with a prominent notice that they should)?
I'm not sure... I think it makes sense to put it into the signature.
Whether that is done by librsync or rdiff, I don't know. In either case,
it will require a change to the signature file.
If we are going to change the signature file, we may as well do it
right, adding both a whole-file checksum and a variable seed.
Sounds good, especially the variable seed part. :-)
Since that forces a change in librsync, maybe use the opportunity to
change from MD4 to a less broken hash? Just a quick check on my 2GHz
Athlon XP:
* "rdiff signature" on large cached file: 109MB/sec
* OpenSSL MD5, 1KB blocks: 246MB/sec
8KB blocks: 326MB/sec
* OpenSSL SHA1, 1KB blocks: 148MB/sec
8KB blocks: 174MB/sec
So upgrading to MD5 is almost for free, and SHA1 won't slow things down
noticeably either. With randomized IV, both (but especially SHA1) should
be OK in this context.
BTW, a strong full-file hash would have its own computational cost: both
the sender and the receiver would need to compute it in addition to the
per-block hashes. In principle, for mostly-unchanged files this cost can
be made negligible by replacing the full-file hash by a "meta-hash": as
you transmit or receive, note the hash of the data represented by each
packet, and compute the meta-hash as the hash of all these hashes. For a
literal data packets, the packet's hash is just the hash of its literal
content. But for a copy-block-from-old-file packet, the sender takes the
packet hash to be the (untruncated) hash of the *new* block, while the
receiver takes the packet hash to be the (untruncated) hash of the
copied *old* block -- both of these were computed anyway by the
respective parties! One catch is that, in the librsync framework, during
the delta stage the receiver no longer remembers the untruncated block
hashes computed at the delta stage, and if he needs to recompute them
then he doesn't get any benefit compared to plain full-file hashes. But
the sender (which is already doing the heavy lifting) does get the
meta-hash essentially for free.
The computation of these meta-hashes can be done only inside librsync,
so even if you prefer to initially implement full-file hashes in the
straightforward way, keeping the option open seems reason enough to do
the integrity checking inside librsync.
I'm now working at google, and am seriously considering making librsync
my 20% project... hopefully this means I will do some serious work on it
soon.
Ah, congratulations! Lucky Google!
Eran
|
| Fri Jul 01, 2005 10:44 am |
|
 |
Donovan Baarda
Guest
|
 more info on 25gig files
On Thu, 2005-06-30 at 13:13, rsync2eran < at > tromer.org wrote:
On 30/06/05 22:46, Donovan Baarda wrote:
[...]
Actually, librsync uses 8 byte strong checksums, which combined with the
4 byte rollsum, gives you 12 bytes of checksum for each block (but you
can only really count on 8 bytes for maliciously crafted data). This
makes it comfortingly negligable unless you have huge files and a
stupidly small blocksize.
Sorry, I meant 8. But as for huge and stupid, how about, say, a 4GB file
and 1K block size? Doesn't seem unreasonable if you're backing up a
large database with many small local changes. That gives you 2^22
blocks, hence 2^44 pairs of blocks, hence (for random data) a
one-in-a-million chance of a corruption.
The number of pairs for n blocks is n*n/2, or 2^43, so that's a one in
2^21 chance.... ie if you have 2 million 4GB files, you are likely to
get a signature md4sum collision in one. That might not be comfortable,
but it is probably not disturbing...
Note that the probability of a corruption is not just the probability of
getting a blocksum collision in the signature. The worst part is the
rollsum walks through the file, effectively trying a block at each
byte-boundary, so on a 4G file, it tries 4G blocks and matches them
against 2^22 1K blocksums, for 2^22 * 2^32 = 2^54 comparisons. This is a
one in a thousand chance of an md4sum collision... much more scary.
In practice it is not quite that bad... as each match effectively skips
over the matching block's bytes, so it is really one block per
non-matching byte... ie the critical thing is not the file size, it's
the number of changed bytes. In the case of a large database file with a
small number of changes, you are usually in the clear.
As for the rollsum, I don't trust it even for non-malicious data. It's
too easy to believe it will be fooled by some natural structure in the
data. And as for malicious settings, well, last time we considered those
It has got to be worth something... even if it only counts for 4 bits,
it still gives you 16 x the confidence.
we saw corruption can occur with probability 1...
Yeah, it all goes out the window with malicious data... the only fix for
this is a variable seed for the signature... probably random is best.
Note that adding a variable seed allows you to correct a blocksum
collision in the signature... if you detect one, just re-generate the
signature with a different seed.
Note that detecting and correcting this would require seek()
functionality on the input. You can't differentiate a matching blocksum
from a genuine collision or identical block without comparing the actual
block data.
You may be able to leave handling of signature collisions up to the
application by simply reporting them and giving the application the
oppertunity to confirm and restart or resume the signature calculation.
The disturbing part is that any corruption that does occur is silently
ignored.
Yes. Do you think the integrity testing should be done by librsync, or
by every client app (with a prominent notice that they should)?
I'm not sure... I think it makes sense to put it into the signature.
Whether that is done by librsync or rdiff, I don't know. In either case,
it will require a change to the signature file.
If we are going to change the signature file, we may as well do it
right, adding both a whole-file checksum and a variable seed.
I'm now working at google, and am seriously considering making librsync
my 20% project... hopefully this means I will do some serious work on it
soon.
--
Donovan Baarda <abo < at > minkirri.apana.org.au>
|
| Fri Jul 01, 2005 10:44 am |
|
 |
Donovan Baarda
Guest
|
 more info on 25gig files
On Fri, 2005-07-01 at 05:44, rsync2eran < at > tromer.org wrote:
Hi,
On 01/07/05 02:31, Donovan Baarda wrote:
The number of pairs for n blocks is n*n/2, or 2^43, so that's a one in
2^21 chance....
You're thinking about colliding block signatures in a single file; I was
thinking about syncing two files that look independently random-looking
(e.g., compressed and/or encrypted). Then you have n blocks for the new
file times n blocks for the old file, so no factor of two.
Ahh, you missed the worst bit... the new file is rolled through checking
blocks at each byte boundary, so you don't have n blocks for the new
file, you have z blocks where z is the size of the file in bytes, ie
2^32, not 2^22.
ie if you have 2 million 4GB files, you are likely to
get a signature md4sum collision in one. That might not be comfortable,
but it is probably not disturbing...
Then it seems our disturbance thresholds are different...
To put it another way, if you have 1000 users running a daily 4GB sync
over 3 years, it's likely that at least one of them will see a
corruption. For increased effect, change 4GB to the subject of this thread.
Yeah... it's still pretty bad...
In practice it is not quite that bad... as each match effectively skips
over the matching block's bytes, so it is really one block per
non-matching byte... ie the critical thing is not the file size, it's
the number of changed bytes. In the case of a large database file with a
small number of changes, you are usually in the clear.
That's because of the heuristic that considers first the next block in
the old file, right? Yes, I guess this does improve things on most
workloads.
No... not that... the fact that when you do find a match, you don't
attempt to find more matches for blocks at every byte offset in that
block.
The "favor next block" behaviour is something that would help avoid
problems when you have blocksum collisions in a signature, but it's not
worth relying on. I'm also not sure that it's actually implemented....
As for the rollsum, I don't trust it even for non-malicious data. It's
too easy to believe it will be fooled by some natural structure in the
data. And as for malicious settings, well, last time we considered those
It has got to be worth something... even if it only counts for 4 bits,
it still gives you 16 x the confidence.
ACK.
what? you don't think 1 in 16 million is better than 1 in 1 million ?
You may be able to leave handling of signature collisions up to the
application by simply reporting them and giving the application the
oppertunity to confirm and restart or resume the signature calculation.
That's a strange question to ask the application... And often
unanswerable (e.g., for unseekable input).
Yeah, but it's up to the application to decide if it's un-answerable.
If you don't check those collision within librsync (as above), maybe
it's better to just keep assuming that colliding blocks are identical,
go through with the transfer, and then check the whole-file checksum. If
it doesn't match, tell the application to redo everything with a new
random seed (and perhaps larger hashes, like rsync does automatically).
The point is, if you get a collision in the signature, you may be able
to detect it and recover at signature generation time. For certain
applications this might be much better than finding out months later
when you try to apply the delta.
The disturbing part is that any corruption that does occur is silently
ignored.
Yes. Do you think the integrity testing should be done by librsync, or
by every client app (with a prominent notice that they should)?
I'm not sure... I think it makes sense to put it into the signature.
Whether that is done by librsync or rdiff, I don't know. In either case,
it will require a change to the signature file.
If we are going to change the signature file, we may as well do it
right, adding both a whole-file checksum and a variable seed.
Sounds good, especially the variable seed part. :-)
Since that forces a change in librsync, maybe use the opportunity to
change from MD4 to a less broken hash? Just a quick check on my 2GHz
Athlon XP:
* "rdiff signature" on large cached file: 109MB/sec
* OpenSSL MD5, 1KB blocks: 246MB/sec
8KB blocks: 326MB/sec
* OpenSSL SHA1, 1KB blocks: 148MB/sec
8KB blocks: 174MB/sec
I'm unconvinced that a different signature is worth it. You are not
comparing apples with apples here; the rdiff signature includes rollsums
etc, and probably a crappy md4sum implementation.
All the arguments for md5 or sha1 vs md4 revolve around extra protection
against malicious attacks that craft hash collisions. In the case of
librsync, the biggest threat from malicious attacks does not revolve
around vulnerabilities in md4sum, but from the fixed seed. Changing to a
variable seed will solve these, and I don't think another hash would do
anything other than chew more CPU.
Though it someone can demonstrate that a different hash has a
significantly better distribution over random data than md4, then maybe
it would be worth considering (ie, it would avoid accidental collisions
better).
So upgrading to MD5 is almost for free, and SHA1 won't slow things down
noticeably either. With randomized IV, both (but especially SHA1) should
be OK in this context.
BTW, a strong full-file hash would have its own computational cost: both
the sender and the receiver would need to compute it in addition to the
per-block hashes. In principle, for mostly-unchanged files this cost can
be made negligible by replacing the full-file hash by a "meta-hash": as
you transmit or receive, note the hash of the data represented by each
packet, and compute the meta-hash as the hash of all these hashes. For a
literal data packets, the packet's hash is just the hash of its literal
content. But for a copy-block-from-old-file packet, the sender takes the
packet hash to be the (untruncated) hash of the *new* block, while the
receiver takes the packet hash to be the (untruncated) hash of the
copied *old* block -- both of these were computed anyway by the
respective parties! One catch is that, in the librsync framework, during
the delta stage the receiver no longer remembers the untruncated block
hashes computed at the delta stage, and if he needs to recompute them
then he doesn't get any benefit compared to plain full-file hashes. But
the sender (which is already doing the heavy lifting) does get the
meta-hash essentially for free.
[...]
Hmm. I'll have to think about this. My gut feeling is an additional
whole file hash calculated alongside the blocksum hashes would not be a
significant cost... particularly compared to the rollsum+hashtable
lookups at every byte.
Also, the "metahash" hash of all the blocksums worries me... it might
not be very reliable.
--
Donovan Baarda <abo < at > minkirri.apana.org.au>
|
| Fri Jul 01, 2005 6:30 pm |
|
 |
Donovan Baarda
Guest
|
 more info on 25gig files
On Mon, 2005-07-04 at 08:03, rsync2eran < at > tromer.org wrote:
[...]
Speaking of variable seeds, the best way to implement it elegantly and
without relying on a patched MD4 (or whatever) implementation is to use
salt, i.e., by prepending the random string to the file data and running
the usual hash.
In all the md4sum implementations I've seen, the struct used is not
opaque, so you can pre-seed by directly setting the A B C D values. I'm
not sure what is better...
* "rdiff signature" on large cached file: 109MB/sec
* OpenSSL MD5, 1KB blocks: 246MB/sec
8KB blocks: 326MB/sec
* OpenSSL SHA1, 1KB blocks: 148MB/sec
8KB blocks: 174MB/sec
You are not
comparing apples with apples here; the rdiff signature includes rollsums
etc, and probably a crappy md4sum implementation.
My point exactly: it follows that switching to OpenSSL's MD5 won't slow
things down compared to the current code.
Yeah, but switching to OpenSSL's md4 sum will make it faster, without
making it any weaker.
All the arguments for md5 or sha1 vs md4 revolve around extra protection
against malicious attacks that craft hash collisions. In the case of
librsync, the biggest threat from malicious attacks does not revolve
around vulnerabilities in md4sum, but from the fixed seed. Changing to a
variable seed will solve these,
I don't think RC4 was ever evaluated in the context of variable seeds.
But it is an extremely weak hash function, so I would not be surprised
if it turns out to be susceptible even to that. If the cost in
performance is negligible and compatibility is broken anyway, why not
use something less broken?
I'm unconvinced that the cost in performance is negligible. I'm also not
convinced that it is "extremely weak", particularly in the context of
our application.
Though it someone can demonstrate that a different hash has a
significantly better distribution over random data than md4, then maybe
it would be worth considering (ie, it would avoid accidental collisions
better).
I find this unlikely.
The crucial point is that for librsync, the important part of the hash
used is it's distribution, not it's "strength" (BTW, for those wanting
to convince me otherwise, I think the two are actually related).
Also, the "metahash" hash of all the blocksums worries me... it might
not be very reliable.
Merkle hash trees, are quite similar and well-analyzed by the
cryptographic community. In a Merkle hash tree you have a hash function
compressing a 2n-bit string to an n-bit string, and you compute the
n-bit hash of the whole file recursively: first you hash it into half
the size by applying the hash to each aligned 2n-bit block, then to
quarter of the size, etc. It has many wonderful uses. The only essential
difference compared to our meta-hash is that the adversary can affect
the partitioning of the raw data into chuns, but that shouldn't help him
if he can't control the hash of those chunks. But if you want to be
really calm in case of exotic cryptanalytic attacks, when computing the
hash of a chunk represented by a packet, just include the chunk offset
and chunk length in the input to the hash.
I wasn't that concerned about exotic cryptanalytic attacks, more about
random whole-file checksum collisions... particularly given the fact
that we are trying to use the whole-file checksum as a way of detecting
blocksum collisions... if the whole-file checksum is actually a checksum
of those blocksums...
I guess I'd need convincing that it's a reliable whole-file checksum,
and that the savings are worth it.
--
Donovan Baarda <abo < at > minkirri.apana.org.au>
|
| Wed Jul 06, 2005 12:38 pm |
|
 |
|
|
The time now is Sat Feb 11, 2012 5:49 am | All times are GMT - 8 Hours
|
Page 1 of 1
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|