Welcome! » Log In » Create A New Profile

deduplication

Posted by Anonymous 
deduplication
January 28, 2008 10:22AM
I have been using rsnapshot for over a year as an addition to my existing backup software (as I add servers I do not buy additional licenses!)

I am planning on revamping my backup strategy in the next year and looking to move to offsite disk versus tape rotation. As I do this I am going to switch to rsnapshot as my primary backup solution.

However offsite disk space is not cheap, and while rsnapshot does an awesome job with file links for saving space in snapshots, I am wondering is there any way to have rsnapshot do data deduplication?

An example: I have two files (images) which are the same file, same name, same data, perhaps even same or similar timestamps, in two different folders on my fileserver (silly users) Is it possible to have rsnapshot store only one file and link to it from the other folder?

I assume the answer is yes but that would be a ton of processing and take way too long, focus on your fileserving software to deduplicate as it stores files and use rsnapshot to backup the already deduplicated files. This may or may not be the answer, but I thought I would ask.

Thanks for your help!

-Paul
deduplication
January 29, 2008 01:20AM
Hallo, Paul,

Du (paulk) meintest am 28.01.08:

[quote]An example: I have two files (images) which are the same file, same
name, same data, perhaps even same or similar timestamps, in two
different folders on my fileserver (silly users) Is it possible to
have rsnapshot store only one file and link to it from the other
folder?
[/quote]
Run "freedup"

http://freedup.org

And run it not daily, but (p.e.) weekly.

You can find such pairs (or larger groups) p.e. with the "GPL" in many
packets - every packet has the same file.

Or you can find those files when you install new user accounts basing on
"/etc/skel". These files shouldn't be hard linked on the original
system, and rsnapshot lets them unlinked.

Viele Gruesse!
Helmut

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
deduplication
January 28, 2008 11:15AM
On Mon, 2008-01-28 at 12:44 -0500, Paul Kortman wrote:
[quote]I am wondering is there any way to have rsnapshot do data
deduplication?

An example: I have two files (images) which are the same file, same
name, same data, perhaps even same or similar timestamps, in two
different folders on my fileserver (silly users) Is it possible to
have rsnapshot store only one file and link to it from the other
folder?
[/quote]
Yes. I know of two ways you could achieve this:

1. After each snapshot is taken, run Fedora's "hardlink" tool on the new
snapshot to convert duplicates to hard links. (You could even run it on
the entire snapshot root to catch duplicates that have different names
in different snapshots because the source files were renamed.) For
convenience, you can specify a "hardlink" command line as the
"cmd_postexec" in the rsnapshot configuration. "hardlink" is available
through the Fedora yum repositories and at:

http://cvs.fedora.redhat.com/viewcvs/devel/hardlink/

2. Get a copy of rsync containing the distributed patch
"link-by-hash.diff", and specify the --link-by-hash rsync option in the
rsnapshot configuration to have rsync reuse identical destination files
via a hashtable. (I have never tried this option.)

Matt

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
deduplication
January 29, 2008 05:11PM
On Tue, 2008-01-29 at 20:35 +0100, Johannes Niess wrote:
[quote]You could pull existing hardlinks over to new backups. This reduces the need
to find identical files. When adding backups with similar files you could
seed them:

1) cp -al /snapshots/daily.0/existing /snapshots/daily.0/new
2) Add the "new" backup to rsnapshot.conf and run it.
3) Remove the seeding /snapshots/daily.1/new as it only contains hardlinks of
what is now /snapshots/daily.1/existing

Typical use cases are full backups of the same operating system on different
hardware or with different applications. Starting with an incremental backup
should be very efficient for slow network transfer.
[/quote]
Yes, this is a good approach when adding a backup point that mostly
parallels an existing one, but I think Paul's issue was with
unpredictably sprinkled identical files.

Actually, when I used rsnapshot to back up two similar computers, I gave
each backup point an extra --link-dest option pointing to the previous
version of the other backup point. At the expense of some additional
stat calls, this setup saved space when I copied a file from one
computer to the same place on the other even long after initializing the
setup.

Matt

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
deduplication
January 28, 2008 03:19PM
On Mon, Jan 28, 2008 at 01:37:45PM -0500, Matt McCutchen wrote:
[quote]On Mon, 2008-01-28 at 12:44 -0500, Paul Kortman wrote:
[quote]I am wondering is there any way to have rsnapshot do data
deduplication?

An example: I have two files (images) which are the same file, same
name, same data, perhaps even same or similar timestamps, in two
different folders on my fileserver (silly users) Is it possible to
have rsnapshot store only one file and link to it from the other
folder?
[/quote]
Yes. I know of two ways you could achieve this:

1. After each snapshot is taken, run Fedora's "hardlink" tool on the new
snapshot to convert duplicates to hard links. (You could even run it on
the entire snapshot root to catch duplicates that have different names
in different snapshots because the source files were renamed.) For
convenience, you can specify a "hardlink" command line as the
"cmd_postexec" in the rsnapshot configuration.
[/quote]
Just bear in mind that cmd_postexec is run after every backup. So if
you have 3 backup lines in your rsnapshot.conf then hardlink would be
run 3 times for each rsnapshot, which is probably not what you want.

If desired you could run hardlink over your most recent snapshot
(eg: hourly.0) after each rsnapshot, and run hardlink over the
whole snapshot_root once a week.

[quote]"hardlink" is available
through the Fedora yum repositories and at:

http://cvs.fedora.redhat.com/viewcvs/devel/hardlink/
[/quote]
BTW, in older versions (like FC3, FC4, RHEL4) /usr/sbin/hardlink was
included with kernel-utils.

--
___________________________________________________________________________
David Keegel <djk < at > cybersource.com.au> http://www.cyber.com.au/users/djk/
Cybersource P/L: Linux/Unix Systems Administration Consulting/Contracting

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Sorry, only registered users may post in this forum.

Click here to login