SearchFAQMemberlist Log in
Reply to topic Page 1 of 1
Re: Redundancy between backups?
Author Message
Post Re: Redundancy between backups? 
David Cantrell wrote:
Nathan Rosenquist wrote:
Also, not to spread FUD, but I"m almost positive that meta information
about the files (like timestamps, maybe even permissions, etc) might get
lost if 2 identical files with different meta information are merged.

They will get lost. EVERYTHING about a file apart from its name is
stored in the inode, so if two files share an inode they have to share
all that other stuff.

That said, this works surprisingly well in many cases. For instance,
if a bunch of machines have the same distribution and packages
installed, then it's pretty likely that their times and permissions
and whatnot are the same. I also have used this when I needed to
transition services between machines. I'll either cp -al the big data
on the backup server before the first rsnapshot, or after the
rsnapshot use a script to stitch everything back together.

Rather than calling stat() n^m times, I recommend calling it n times and
using join("," (stat($file))[0, 2..10]) as the key in a hash of
arrayrefs. This assumes you have the memory available for a Very Big
data structure.

A trick I've seen (and later used) was to run a find across the
filesystem, emitting the inode, size, relevant timestamps,
permissions, owner, and path, into a giant file. Then sort that file
using sort, which is pretty efficient for sorting such a huge file,
compared to anything you're likely to whip up in Perl. _Then_ feed
that file to a Perl script which scans it looking for potential
matches (different inodes, but otherwise same information) and groups
them together, then runs checksums on them to winnow things down
another level. Lastly, hardlink anything which seems likely, possibly
with a byte-for-byte compare beforehand.

It sounds convoluted, but by using the external sort, the comparison
pass doesn't really need much memory, so it's pretty fast (compared to
the single-pass "collect all the information and process" version).

-scott


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss

Post Re: Redundancy between backups? 
Am Mittwoch, 31. August 2005 02:35 schrieb Scott Hess:
David Cantrell wrote:
Nathan Rosenquist wrote:
Also, not to spread FUD, but I"m almost positive that meta information
about the files (like timestamps, maybe even permissions, etc) might
get lost if 2 identical files with different meta information are
merged.

They will get lost. EVERYTHING about a file apart from its name is
stored in the inode, so if two files share an inode they have to share
all that other stuff.

That said, this works surprisingly well in many cases. For instance,
if a bunch of machines have the same distribution and packages
installed, then it's pretty likely that their times and permissions
and whatnot are the same. I also have used this when I needed to
transition services between machines. I'll either cp -al the big data
on the backup server before the first rsnapshot, or after the
rsnapshot use a script to stitch everything back together.

I've played with the idea of writing a metainfo dumper. It would use OS+FS
native tools to dump the information into a file per directory. (Hardlinked)
file content and metainfo would be separated. As each filesystem handles it
different, we would need a plugin architecture that is easy to extend. This
is like the TRANS.TBL files on old CDs. Now a backup could be done as today.
Restoring is done in reverse order. First the files are restored. In a second
pass the metainfo is adjusted.


Rather than calling stat() n^m times, I recommend calling it n times and
using join("," (stat($file))[0, 2..10]) as the key in a hash of
arrayrefs. This assumes you have the memory available for a Very Big
data structure.

A trick I've seen (and later used) was to run a find across the
filesystem, emitting the inode, size, relevant timestamps,
permissions, owner, and path, into a giant file. Then sort that file
using sort, which is pretty efficient for sorting such a huge file,
compared to anything you're likely to whip up in Perl. _Then_ feed
that file to a Perl script which scans it looking for potential
matches (different inodes, but otherwise same information) and groups
them together, then runs checksums on them to winnow things down
another level. Lastly, hardlink anything which seems likely, possibly
with a byte-for-byte compare beforehand.

It sounds convoluted, but by using the external sort, the comparison
pass doesn't really need much memory, so it's pretty fast (compared to
the single-pass "collect all the information and process" version).

This external sort can offer huge speed improvements. You just sort by inode.
The result is doing part of the OS work but drastically reduces seeks and
utilizes readahead, especially on used filesystems. The change in harddisk
noise was impressing: From the usual loud scratching sounds to a quiet tok
tok tok.

Unfortunately rsync is one of the few monolithic applications, so we could
only do this by feeding rsync with the sorted worklist (--files-from).

This would nicely fit with the proposed metainfo dump. But rsnapshot would
(have to) be a different application after that.

That's enough vaporware for today.

Johannes Nieß

Display posts from previous:
Reply to topic Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB