SearchFAQMemberlist Log in
Reply to topic Page 1 of 2
Goto page 1, 2  Next
Feature requests questions/discussion
Author Message
Post Feature requests questions/discussion 
Ben Escoto,

I would like to discuss two feature requests. I'd like your input on the matter.
It's quite some text, so I hope you'll bear with me :)

First, I would really like an option like --store-checksums so that rdiff-backup
calculates md5 hashes when doing a backup, and that that checksum it used for
integrity checks upon restoration. But at restoration time, the check should not
be an option, it should always be done, if the file being restored has hash
info. This to prevent user-mistake. I once severly corrupted a partition on an
external HD because of USB2 transfer errors (which I haven't been able to solve
BTW). A lot was damaged, but of every file that resided in compressed files I
was told about the corruption, beceause of the hashing usually done in
zips/gzips/etc. The rdiff-backup repository became useless, because it had no
idea the files were damaged. Such an option would of course be annoying to most
people, because it's quite slow, but most of my backups are done through cron,
so it doesn't matter for me. I think as an option, this could be very valuable.
It doesn't even have to be that slow, if it's cleverly integrated with the copy
routine.

The other thing I'd like to discuss is how rdiff-backup detects change. I noted
earlier in this list that mtime+size checking, which rdiff-backup does IIRC, is
not very reliable. Mtimes can be changed. For example, when I install a new GCC
on Gentoo, the package manager looks for hardcoded filepaths in a whole bunch of
files on the sytem and changes them to reflect the newly installed GCC. Portage
(gentoo package manager) uses the mtimes of files to determine if the file still
belongs to a package. So, when you uninstall a package and some file has a
different mtime than is stored in the meta-file, it is assumed that this is a
new file, meaning it doesn't belong to the package, and is not uninstalled. Now,
about those hardcoded filepaths. When portage changes them, the mtime of the
files are also changed. I don't know if portages then restores the mtimes back
to what they were, to avoid orphaned files, but it should. And if it doesn't, it
may in the future. Now, when you run rdiff-backup again on your system, those
files are not detected as changed and they are left alone. This is of course not
desirable behaviour.

A different way of checking for change would be checking the ctimes. But, this
of course has the problem that not all filesystems have ctimes. And, when you
restore your backup to a new disk and run rdiff-backup again, the entire system
is considered as changed. This is not very ideal.

A different approach would be using the checksums feature describe above.
Rdiff-backup could calculate a hash of every file (or perhaps only of those
files with unchanged mtimes because when the mtime has changed, it needs to
backuped anyway) and use that for change comparison. This of course has the
disadvantage of yet more slowdown, because now even if little has changed in
what your backing up, it's contents is read completely. But, perhaps this
behaviour could also reside under an option, an option besides the
--store-checksums, like --checksum-diffs (with the latter requiring the former
to be present, for example).

Summarized, --store-checksums would calculate checksum info for integrity
checks, and --checksum-diffs would use checksums for change-detections, instead
of mtime+size.

I'm very curious to find out if you find my requests valid.

Regards,

Wiebe Cazemier

Post Feature requests questions/discussion 
Wiebe Cazemier <halfgaar < at > gmail.com>
wrote the following on Fri, 21 Oct 2005 17:48:47 +0200

First, I would really like an option like --store-checksums so that
rdiff-backup calculates md5 hashes when doing a backup, and that
that checksum it used for integrity checks upon restoration. But at
restoration time, the check should not be an option, it should
always be done, if the file being restored has hash info. This to
prevent user-mistake. I once severly corrupted a partition on an
external HD because of USB2 transfer errors (which I haven't been
able to solve BTW). A lot was damaged, but of every file that
resided in compressed files I was told about the corruption,
beceause of the hashing usually done in zips/gzips/etc. The
rdiff-backup repository became useless, because it had no idea the
files were damaged. Such an option would of course be annoying to
most people, because it's quite slow, but most of my backups are
done through cron, so it doesn't matter for me. I think as an
option, this could be very valuable. It doesn't even have to be that
slow, if it's cleverly integrated with the copy routine.

It's a good idea, and one that someone else has suggested before. The
checksums would be stored in the mirror-metadata file. I don't even
think it would be hard to implement. And there could be a --verify
switch to go through the repository and make sure everything checksums
correctly.

Summarized, --store-checksums would calculate checksum info for
integrity checks, and --checksum-diffs would use checksums for
change-detections, instead of mtime+size.

Another good suggestion I think, which has come up before. You
mentioned ctime before, I was going to add in ctime checking but there
was some complication (I forget what) and it never got in.

Does anyone else think they would use Wiebe's --checksum-diff option?


--
Ben Escoto

Post Feature requests questions/discussion 
Ben Escoto <ben < at > emerose.org> writes:

Another good suggestion I think, which has come up before. You
mentioned ctime before, I was going to add in ctime checking but there
was some complication (I forget what) and it never got in.

Does anyone else think they would use Wiebe's --checksum-diff
option?

Sounds like a very good tool for a system integrity check after a
breakin or other security problem.

\EF
--
Erik Forsberg OpenSource-based Thin Client Technology
Cendio AB Phone: +46-13-21 46 00
Web: http://www.thinlinc.com

Post Feature requests questions/discussion 
Erik Forsberg wrote:
Does anyone else think they would use Wiebe's --checksum-diff
option?


Sounds like a very good tool for a system integrity check after a
breakin or other security problem.

Actually, that would be the --verify option.

The --checksum-diff option I suggested uses the actual contents of the file to
determine if the it has changed, instead of using the less reliable mtime+size
combination. In my opinion, a mandatory requirement for having an acurate image
of your entire OS-installation. I hope others think so too.

Post Feature requests questions/discussion 
Ben Escoto wrote:
It's a good idea, and one that someone else has suggested before. The
checksums would be stored in the mirror-metadata file. I don't even
think it would be hard to implement.

Do you think it's possible to combine it with the copy syscall/API-call
rdiff-backup probably uses so that the data which is read by copying is
checksummed at the same time? If it is, it wouldn't even take more time to
backup a new file than it would without checksums.

And there could be a --verify
switch to go through the repository and make sure everything checksums
correctly.

Indeed, I even forgot this.

You
mentioned ctime before, I was going to add in ctime checking but there
was some complication (I forget what) and it never got in.

If --checksum-diffs would make it, I wouldn't need ctime checking. But I guess
there is value for such an option (speed being a definite advantage), as a third
change-detection method. But it won't be widely useable I guess. I wouldn't be
surprised if NFS or SAMBA don't supply correct ctimes. And what about FAT32?
Doing ls -lc does reveil different times than a normal ls -l, but does that FS,
and the windows driver which controlles it, properly set ctimes?

Post Feature requests questions/discussion 
Erik Forsberg wrote:
Ah. OK.

On the other hand - a filesystem verify functionality that is based on
the MD5 (or other appropriate checksum) sum would perhaps also be a
good idea, if it would run faster than the existing --verify. I don't
know how --verify works, so this may not be the case.

Existing --verify? The --verify Ben suggested would _be_ one that uses the
checksum... There is no --verify at present.

Post Feature requests questions/discussion 
Wiebe Cazemier <halfgaar < at > gmail.com> writes:

Erik Forsberg wrote:
Does anyone else think they would use Wiebe's --checksum-diff
option?
Sounds like a very good tool for a system integrity check after a
breakin or other security problem.

Actually, that would be the --verify option.

Ah. OK.

On the other hand - a filesystem verify functionality that is based on
the MD5 (or other appropriate checksum) sum would perhaps also be a
good idea, if it would run faster than the existing --verify. I don't
know how --verify works, so this may not be the case.

The --checksum-diff option I suggested uses the actual contents of the
file to determine if the it has changed, instead of using the less
reliable mtime+size combination. In my opinion, a mandatory
requirement for having an acurate image of your entire
OS-installation. I hope others think so too.

Sounds good as well.

\EF
--
Erik Forsberg OpenSource-based Thin Client Technology
Cendio AB Phone: +46-13-21 46 00
Web: http://www.thinlinc.com

Post Feature requests questions/discussion 
Wiebe Cazemier <halfgaar < at > gmail.com> writes:

Erik Forsberg wrote:
Ah. OK. On the other hand - a filesystem verify functionality that
is based on
the MD5 (or other appropriate checksum) sum would perhaps also be a
good idea, if it would run faster than the existing --verify. I don't
know how --verify works, so this may not be the case.

Existing --verify? The --verify Ben suggested would _be_ one that uses
the checksum... There is no --verify at present.

Oh. I'm sorry. Obviously I don't know what I'm talking about right
now, so just disregard me Smile.

Checking the manpage, I think I'm talking about the --compare option.

\EF
--
Erik Forsberg OpenSource-based Thin Client Technology
Cendio AB Phone: +46-13-21 46 00
Web: http://www.thinlinc.com

Post Feature requests questions/discussion 
On Tue, 25 Oct 2005, Ben Escoto wrote:

It's a good idea, and one that someone else has suggested before. The
checksums would be stored in the mirror-metadata file. I don't even
think it would be hard to implement. And there could be a --verify
switch to go through the repository and make sure everything checksums
correctly.

that would be cool... i'd use that.


Summarized, --store-checksums would calculate checksum info for
integrity checks, and --checksum-diffs would use checksums for
change-detections, instead of mtime+size.

Another good suggestion I think, which has come up before. You
mentioned ctime before, I was going to add in ctime checking but there
was some complication (I forget what) and it never got in.

Does anyone else think they would use Wiebe's --checksum-diff option?

in many cases for me this would cost too much on the machine being backed
up (lots of disk seeks and cpu time) ... but maybe i could use it
periodically on weekends...

hey, did you know there's actually nanosecond resolution to [acm]time on
linux 2.6? (and on several BSDs i think) i don't know if the interfaces
show up in python -- but the C structure elements are
st_atimensec/st_ctimensec/st_mtimensec, and the utimes(2) syscall can set
nanosecond resolution timestamps.

-dean

Post Feature requests questions/discussion 
Wiebe Cazemier <halfgaar < at > gmail.com>
wrote the following on Tue, 25 Oct 2005 14:00:30 +0200

Do you think it's possible to combine it with the copy
syscall/API-call rdiff-backup probably uses so that the data which
is read by copying is checksummed at the same time? If it is, it
wouldn't even take more time to backup a new file than it would
without checksums.

Yes, I just checked some patches into the unstable tree which do just
that. So right now each mirror_metadata entry for a regular file has
an "SHA1Digest" field with the 40 character hex digest in it. Only
hash writing has been added--it doesn't actually do anything with the
information yet.

I'm not sure about a speed penalty, but I just recently realized the
biggest drawback of writing hashes. Hash data is incompressable, and
so a 160 bit hash like SHA1 will add at least 20 bytes (and probably
more) per regular file to the size of the compressed mirror_metadata
file.

At least for my usage, this approximately triples the size of the
mirror_metadata file. I like to keep about a year's worth of backups
of my files, and I have about a million files. So adding the hashs
would turn each of my mirror_metadata file from 12MB as they are now
to 32MB+. Over a year that would cost about 8GB.

I was assuming before I would just turn hashing on and not expose any
other option. But with this tradeoff I think we need to give people
the option. So what do you think the default should be? Keep the
hashs and triple the size of the mirror_metadata file?


--
Ben Escoto

Post Feature requests questions/discussion 
On Thu, 27 Oct 2005, Ben Escoto wrote:

At least for my usage, this approximately triples the size of the
mirror_metadata file. I like to keep about a year's worth of backups
of my files, and I have about a million files. So adding the hashs
would turn each of my mirror_metadata file from 12MB as they are now
to 32MB+. Over a year that would cost about 8GB.

you could support increments of the (uncompressed) mirror_metadata :)

unfortunately that would suck in so many ways... i've editted my
mirror_metadata files by hand a number of times and would hate to lose
that.

-dean

Post Feature requests questions/discussion 
dean gaudet <dean-list-rdiff-backup < at > arctic.org>
wrote the following on Tue, 25 Oct 2005 11:18:58 -0700 (PDT)

hey, did you know there's actually nanosecond resolution to
[acm]time on linux 2.6? (and on several BSDs i think) i don't know
if the interfaces show up in python -- but the C structure elements
are st_atimensec/st_ctimensec/st_mtimensec, and the utimes(2)
syscall can set nanosecond resolution timestamps.

Hmm what is the significance of this? rdiff-backup currently just
rounds times to the nearest second. Even that is too fine on some
systems -- someone reported some file system where the file's time was
only reported to the nearest 2 seconds (if you kept statting it you'd
get inconsistent values rounded to the nearest second). I think they
sent in a patch which added a --time-granulatity option.

Anyway it seems unlikely we are going to miss many new files because
they get changed less than 1 second after rdiff-backup processes them.


--
Ben Escoto

Post Feature requests questions/discussion 
dean gaudet <dean-list-rdiff-backup < at > arctic.org>
wrote the following on Wed, 26 Oct 2005 23:50:50 -0700 (PDT)

you could support increments of the (uncompressed) mirror_metadata :)

unfortunately that would suck in so many ways... i've editted my
mirror_metadata files by hand a number of times and would hate to
lose that.

The original plan was to do just that (the metadata files are named
.snapshot, so it would easy to add .diff versions and still preserve
the convention). But then I think I either lost interest or liked the
simplicity of the current system.

We wouldn't have to use librsync to compute the diffs. Instead each
mirror_metadata diff could just be the compressed text of the records
that have changed in the next mirror_metadata file. So things could
still be hand-edited.

Another problem is that we would have to rewrite the mirror_metadata
file each time. If there was a bug in rdiff-backup or a disk
corruption, and the main file was lost, we would lose all the metadata
information. I don't have any specific worry in mind here, it just
makes me uneasy.


--
Ben Escoto

Post Feature requests questions/discussion 
Ben Escoto wrote:
Yes, I just checked some patches into the unstable tree which do just
that. So right now each mirror_metadata entry for a regular file has
an "SHA1Digest" field with the 40 character hex digest in it. Only
hash writing has been added--it doesn't actually do anything with the
information yet.

I'm not sure about a speed penalty, but I just recently realized the
biggest drawback of writing hashes. Hash data is incompressable, and
so a 160 bit hash like SHA1 will add at least 20 bytes (and probably
more) per regular file to the size of the compressed mirror_metadata
file.

At least for my usage, this approximately triples the size of the
mirror_metadata file. I like to keep about a year's worth of backups
of my files, and I have about a million files. So adding the hashs
would turn each of my mirror_metadata file from 12MB as they are now
to 32MB+. Over a year that would cost about 8GB.

I was assuming before I would just turn hashing on and not expose any
other option. But with this tradeoff I think we need to give people
the option. So what do you think the default should be? Keep the
hashs and triple the size of the mirror_metadata file?

Without the hashes, the 8 GB of metadata would be a little under 3 GB. In my
opinion, that doesn't matter much, but I tend to ignore that some people just
don't have the resources to spare. But a difference of 5 GB for a year? One
question you can ask, is "is your data worth the investment of a bit of HD
space". In my opinion, this can be standard behaviour, not available under an
option.

Disadvantages of options, is that it is possible to forget to supply one or two
sometimes. I do my backups from scripts, but what if someone does not? What will
happen if you forget --store-checksums? Will you have an archive with files of
which some are checksummed and some are not? Or will rdiff-backup simply enable
checksums if you're backing up to an archive which already has them?

In the end, my opinion is this: either make it standard behaviour, or when
enabling it with an option, make sure when enabled for the first backup,
subsequent backups have it enabled implictly. Or, when first disabled and then
sometime later enabled, keep it enabled for subsequent backups as well.

Perhaps these issues justify just having it enabled all the time.

BTW, what is your plan what will happen when one upgrades to an rdiff-backup
version which enables checksums? Will an archive be backwards compatible with
older versions?

Post Feature requests questions/discussion 
Ben Escoto wrote:
Another problem is that we would have to rewrite the mirror_metadata
file each time. If there was a bug in rdiff-backup or a disk
corruption, and the main file was lost, we would lose all the metadata
information. I don't have any specific worry in mind here, it just
makes me uneasy.

This is a genuine concern. Remember I corrupted my external HD severly with USB
errors? That sort of thing can happen.

Just two days ago, I had a similair situation. My computer suddenly became very
unstable, probably because I did rrmod -f on a module which I couldn't get
un-used. At first I suspected my HD, so I made a backup as soon as I could. It
crashed as well. Later, when I diagnosed my disc and restarted my system, I ran
a new backup. I was very glad the failed one got rolled back.

I know you can save a temporary mirror_metadata as a working version, but if
rdiff-backup had crashed writing the temp version over the orignal one, you are
bust. Or, power can fail suddenly. Anything can happen...

Anyway, it's good to guard against the unexpected and unprobable.

Display posts from previous:
Reply to topic Page 1 of 2
Goto page 1, 2  Next
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB