On Wed, Aug 01, 2007 at 06:56:37PM +0200, Janek Kozicki wrote:
this one looks promising, I'm testing it now:
http://hardlinkpy.googlecode.com/svn/trunk/hardlink.py
Cool. Please let us know how well it works!
it worked fine on a small sample: about 5 GB (few selected directories).
Next I left it overnight to work on a whole /.snapshots/ (280 GB - would
be 2500 GB if not the hardlinks
went past a single hourly.0 (it started with this one). It filled
about 600 MB of RAM with accumulated data about files, and I
interrupted it with ^C.
This program works in following way (if I understand correctly, I'm
not good at python):
1 run a single recursive loop on all dirs/files
1.1 for each file store its size (and name, and time, and permissions,
depending on options that you pass to it) in RAM
1.2 check in RAM if there was any other file scanned before with the
same size (and name and time and permissions)
1.3 if yes - compare the two files to check if they are identical,
and eventually store them in RAM as a candidates for hardlink.
2 when the first loop is over: hardlink all files found in the first
loop. It's done in a way that always a *file* that uses an inode with
smaller count of references is deleted.
It bothers me that this program does not calculate md5sums. In point
1.3 while performing binary comparison of two files it is possible to
simultaneously calculate the md5sum (because the files are being
read). Those md5sums could be later stored in memory to eliminate
unnecessary comparisons. Of course when we approach a new file we
must compare it with at least one candidate (based on size and name
and time), but later we have its md5sum, because single comparison
was done, thus some of later comparisons can be eliminated.
Also I'm not sure but I suspect that this script does not use the
fact that if some file is found similar with other file (both having
different inodes), the in fact ALL the files that use those two
inodes are identical. And hardlinking them should affect ALL files
using one of the inodes, so that the inode is effectively deleted
(instead of just decreasing the number of references to this inode
by 1).
I'm not a python programmer, I planned to learn python, but not now.
So I'll not start hacking this script right now. Eiter later or
someone of you will do it :)
I googled a bit more, and couldn't find anything as good as this one.
--
Janek Kozicki |
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
