Welcome! » Log In » Create A New Profile

Parallelism and deduplication

Posted by Anonymous 
Parallelism and deduplication
April 25, 2016 03:11AM
Hi to all.
I'm trying to implement a new backup server based on rsnapshot
by adding parallelism (through "parallel") and deduplication.

Just need some confirms, this is what I have right now (and seems
to work perfectly):

/etc/rsnapshot.d/hosts/server1.{conf,passwd}
/etc/rsnapshot.d/hosts/server2.{conf,passwd}
/etc/rsnapshot.d/hosts/server3.{conf,passwd}
/etc/rsnapshot.d/hosts/serverN.{conf,passwd}
/etc/rsnapshot.d/rsnapshot.conf

/etc/rsnapshot.d/rsnapshot.conf is the base configuration,
included from each host's config:

$ cat /etc/rsnapshot.d/hosts/server1.conf
include_conf /etc/rsnapshot.d/rsnapshot.conf
snapshot_root /var/backups/rsnapshot/server1/
logfile /var/log/rsnapshot/server1.log
lockfile /var/run/rsnapshot/server1.pid
backup rsync://rsnapshot < at > server1/everything/ ./
+rsync_long_args=--password-file=/etc/rsnapshot.d/hosts/server1.passwd

$ cat rsnapshot.conf
config_version 1.2
no_create_root 0
cmd_cp /bin/cp
cmd_rm /bin/rm
cmd_rsync /usr/bin/rsync
#cmd_ssh /usr/bin/ssh
cmd_logger /usr/bin/logger
cmd_du /usr/bin/du
#cmd_preexec /path/to/preexec/script
#cmd_postexec /path/to/postexec/script
verbose 3
loglevel 5
lockfile /var/run/rsnapshot.pid
sync_first 1
rsync_short_args -a
rsync_long_args --delete --numeric-ids --relative --delete-excluded --stats
link_dest 1
rsync_numtries 3
retain daily 15
retain weekly 2
exclude /backups/
exclude /admin_backups/
exclude /reseller_backups/
exclude /user_backups/
exclude /tmp/
exclude /proc
exclude /sys
exclude /var/cache
exclude /var/log/lastlog
exclude /var/log/rsync*
exclude /var/lib/mlocate
#exclude /var/spool
exclude /media
exclude /mnt
exclude tmp/

Up to this, is pretty easy.
Now I addeded some parallelism, through parallel.
Just a single entry in crontab:

$ cat /etc/cron.d/rsnapshot
0 0 * * * root /usr/local/bin/parallel_rsnapshot 2>&1 > /dev/null

$ cat /usr/local/bin/parallel_rsnapshot
#!/bin/bash
RSNAPSHOT_SCHEDULER="/usr/local/bin/rsnapshot_scheduler"
HOSTS_PATH="/etc/rsnapshot.d/hosts"
PARALLEL_JOBS=5
# Run rsnapshot in parallel
parallel --jobs ${PARALLEL_JOBS} "${RSNAPSHOT_SCHEDULER} {}" :::
${HOSTS_PATH}/*.conf

And a custom rsnapshot scheduler that choose which backup level to run:
(this is a little bit semplified for posting on list)

$ cat /usr/local/bin/rsnapshot_scheduler
#!/bin/bash
RSNAPSHOT="/usr/bin/rsnapshot -v"
CONFIG=$1
HOST=$(/usr/bin/basename ${CONFIG} | sed 's/.conf$//g')
LOG_PATH="/var/log/rsnapshot"
LOG_FILE="${LOG_PATH}/$(/usr/bin/basename ${CONFIG} | sed
's/.conf$/.log/g')"

function rsnap {
${RSNAPSHOT} -c $1 $2
}

if [ "$(date +%j)" -eq "001" ]; then
rsnap ${CONFIG} yearly
fi
if [ $(date +%d) -eq 1 ]; then
rsnap ${CONFIG} monthly
fi
if [ $(date +%w) -eq 0 ]; then
rsnap ${CONFIG} weekly
fi

rsnap ${CONFIG} sync

# Check if sync was OK. Run daily only if sync is OK.
SUCCESS=$(grep -ci "$(date +%d/%b/%Y).*sync: completed successfully"
${LOG_FILE})
if [ ${SUCCESS} -ne 1 ]; then
EMAIL_SUBJECT="Backup FAILED per ${HOST}"
else
EMAIL_SUBJECT="Backup OK per ${HOST}"
rsnap ${CONFIG} daily
fi

# Send full log report
grep -i "$(date +%d/%b/%Y)" ${LOG_FILE} | mailx -s "${EMAIL_SUBJECT}"
myemail < at > mydomain.tld

Now, some questions:

1) should I run weekly, monthly and yearly before or after the sync
process? If run before, i'll rotate some backups with no confirm that
the followind sync would be ok. This could lead to some missing backups
(the ones deleted by rotation). Probably, the first thing to do is a
sync. If all is OK, then rotate everything else. Right?

2) On remote server, I have a "pre-xfer" script that run some actions
and output some debug text (like "echo Running xy....")
Is possible to get the output logged by rsnapshot ?

3) I would like to add some deduplication feature. Actually, I can run
"hardlink" over the latest backup (for example, daily.0), but this will
"deduplicate" only files in the same backup pool. How can I deduplicate
common files acroll all pools ? Is safe to run "hardlinks" across the
whole rsnapshot directory? (hardlink /var/backups/rsnapshot/server1/*) ?
This would save much more space....

Thanks in advance.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
April 29, 2016 11:10AM
no one ?

Il 25/04/2016 12:08, Gandalf Corvotempesta ha scritto:
[quote]Hi to all.
I'm trying to implement a new backup server based on rsnapshot
by adding parallelism (through "parallel") and deduplication.

Just need some confirms, this is what I have right now (and seems
to work perfectly):

/etc/rsnapshot.d/hosts/server1.{conf,passwd}
/etc/rsnapshot.d/hosts/server2.{conf,passwd}
/etc/rsnapshot.d/hosts/server3.{conf,passwd}
/etc/rsnapshot.d/hosts/serverN.{conf,passwd}
/etc/rsnapshot.d/rsnapshot.conf

/etc/rsnapshot.d/rsnapshot.conf is the base configuration,
included from each host's config:

$ cat /etc/rsnapshot.d/hosts/server1.conf
include_conf /etc/rsnapshot.d/rsnapshot.conf
snapshot_root /var/backups/rsnapshot/server1/
logfile /var/log/rsnapshot/server1.log
lockfile /var/run/rsnapshot/server1.pid
backup rsync://rsnapshot < at > server1/everything/ ./
+rsync_long_args=--password-file=/etc/rsnapshot.d/hosts/server1.passwd

$ cat rsnapshot.conf
config_version 1.2
no_create_root 0
cmd_cp /bin/cp
cmd_rm /bin/rm
cmd_rsync /usr/bin/rsync
#cmd_ssh /usr/bin/ssh
cmd_logger /usr/bin/logger
cmd_du /usr/bin/du
#cmd_preexec /path/to/preexec/script
#cmd_postexec /path/to/postexec/script
verbose 3
loglevel 5
lockfile /var/run/rsnapshot.pid
sync_first 1
rsync_short_args -a
rsync_long_args --delete --numeric-ids --relative
--delete-excluded --stats
link_dest 1
rsync_numtries 3
retain daily 15
retain weekly 2
exclude /backups/
exclude /admin_backups/
exclude /reseller_backups/
exclude /user_backups/
exclude /tmp/
exclude /proc
exclude /sys
exclude /var/cache
exclude /var/log/lastlog
exclude /var/log/rsync*
exclude /var/lib/mlocate
#exclude /var/spool
exclude /media
exclude /mnt
exclude tmp/

Up to this, is pretty easy.
Now I addeded some parallelism, through parallel.
Just a single entry in crontab:

$ cat /etc/cron.d/rsnapshot
0 0 * * * root /usr/local/bin/parallel_rsnapshot 2>&1 > /dev/null

$ cat /usr/local/bin/parallel_rsnapshot
#!/bin/bash
RSNAPSHOT_SCHEDULER="/usr/local/bin/rsnapshot_scheduler"
HOSTS_PATH="/etc/rsnapshot.d/hosts"
PARALLEL_JOBS=5
# Run rsnapshot in parallel
parallel --jobs ${PARALLEL_JOBS} "${RSNAPSHOT_SCHEDULER} {}" :::
${HOSTS_PATH}/*.conf

And a custom rsnapshot scheduler that choose which backup level to run:
(this is a little bit semplified for posting on list)

$ cat /usr/local/bin/rsnapshot_scheduler
#!/bin/bash
RSNAPSHOT="/usr/bin/rsnapshot -v"
CONFIG=$1
HOST=$(/usr/bin/basename ${CONFIG} | sed 's/.conf$//g')
LOG_PATH="/var/log/rsnapshot"
LOG_FILE="${LOG_PATH}/$(/usr/bin/basename ${CONFIG} | sed
's/.conf$/.log/g')"

function rsnap {
${RSNAPSHOT} -c $1 $2
}

if [ "$(date +%j)" -eq "001" ]; then
rsnap ${CONFIG} yearly
fi
if [ $(date +%d) -eq 1 ]; then
rsnap ${CONFIG} monthly
fi
if [ $(date +%w) -eq 0 ]; then
rsnap ${CONFIG} weekly
fi

rsnap ${CONFIG} sync

# Check if sync was OK. Run daily only if sync is OK.
SUCCESS=$(grep -ci "$(date +%d/%b/%Y).*sync: completed successfully"
${LOG_FILE})
if [ ${SUCCESS} -ne 1 ]; then
EMAIL_SUBJECT="Backup FAILED per ${HOST}"
else
EMAIL_SUBJECT="Backup OK per ${HOST}"
rsnap ${CONFIG} daily
fi

# Send full log report
grep -i "$(date +%d/%b/%Y)" ${LOG_FILE} | mailx -s "${EMAIL_SUBJECT}"
myemail < at > mydomain.tld

Now, some questions:

1) should I run weekly, monthly and yearly before or after the sync
process? If run before, i'll rotate some backups with no confirm that
the followind sync would be ok. This could lead to some missing backups
(the ones deleted by rotation). Probably, the first thing to do is a
sync. If all is OK, then rotate everything else. Right?

2) On remote server, I have a "pre-xfer" script that run some actions
and output some debug text (like "echo Running xy....")
Is possible to get the output logged by rsnapshot ?

3) I would like to add some deduplication feature. Actually, I can run
"hardlink" over the latest backup (for example, daily.0), but this will
"deduplicate" only files in the same backup pool. How can I deduplicate
common files acroll all pools ? Is safe to run "hardlinks" across the
whole rsnapshot directory? (hardlink /var/backups/rsnapshot/server1/*) ?
This would save much more space....

Thanks in advance.
[/quote]

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
April 29, 2016 05:55PM
On Mon, Apr 25, 2016 at 3:08 AM, Gandalf Corvotempesta <gandalf.corvotempesta < at > gmail.com ([email]gandalf.corvotempesta < at > gmail.com[/email])> wrote:
[quote]I&#39;m trying to implement a new backup server based on rsnapshot
by adding parallelism (through "parallel") and deduplication.
[/quote]

Meta-comment: Often I/O bandwidth or seeks are the limiting factor, and running things in parallel will convert 5 serial 10-minute jobs into 5 parallel 60-minute jobs.  Make sure you&#39;re testing for that in your case.  Especially the cp -al and rm -rf phases are unlikely to enjoy competing with each other for resources, whereas rsync phases could plausibly timeshare with each other.

[quote] Now, some questions:

1) should I run weekly, monthly and yearly before or after the sync
process? If run before, i&#39;ll rotate some backups with no confirm that
the followind sync would be ok. This could lead to some missing backups
(the ones deleted by rotation). Probably, the first thing to do is a
sync. If all is OK, then rotate everything else. Right?
[/quote]

Only the first level benefits from sync.  I always run the longer-period rsnapshot calls first, because when combined with use_lazy_deletes, those runs can be relatively tightly constrained in runtime (though I leave a gap to my next sync, to let all the lazy deletes clear out).

[quote] 2) On remote server, I have a "pre-xfer" script that run some actions
and output some debug text (like "echo Running xy....")
Is possible to get the output logged by rsnapshot ?
[/quote]

How and where?  Why not just have the script log somewhere directly?

[quote] 3) I would like to add some deduplication feature. Actually, I can run
"hardlink" over the latest backup (for example, daily.0), but this will
"deduplicate" only files in the same backup pool. How can I deduplicate
common files acroll all pools ? Is safe to run "hardlinks" across the
whole rsnapshot directory? (hardlink /var/backups/rsnapshot/server1/*) ?
This would save much more space....
[/quote]

Running hardlink every time is probably going to take a long time and not do much most of the time.  If you are frequently generating duplicate data across hosts (like from a script or something), you might either explore whether you actually need to backup that data, or whether you can dedupe in targeted directories.  I will dedupe periodically either when I&#39;m doing some other maintenance or when I notice disk usage seems up.

I just dedupe across my entire volume.  The premise of rsnapshot means that hardlinks are fine, if they weren&#39;t the entire thing would fall over.  Just make sure you aren&#39;t deduping while rsnapshot is running.  On most systems you cannot hardlink across volumes.

-scott
Parallelism and deduplication
April 30, 2016 03:24AM
2016-04-30 2:52 GMT+02:00 Scott Hess <scott < at > doubleu.com>:
[quote]Meta-comment: Often I/O bandwidth or seeks are the limiting factor, and
running things in parallel will convert 5 serial 10-minute jobs into 5
parallel 60-minute jobs. Make sure you're testing for that in your case.
Especially the cp -al and rm -rf phases are unlikely to enjoy competing with
each other for resources, whereas rsync phases could plausibly timeshare
with each other.
[/quote]
This is something that I've not considered, you are right.
What if i'll use "link-dest" and not "cp -al" ? it should avoid the cp phase.

My biggest issue is to finish all rsync phase as fast as possible, to keep
load on backupped server low. I'm running rsync in the middle of the night,
running backup serially (I have to backup 98 servers) will result having
some rsync running during the day. This is not what I want.

"cp -al" is run on backup server, thus I don't care about load.

As improvement, I can run "rsnapshot sync" in parallel, and after all syncs,
I can run the daily rotation, even sequeantially.

But with "link-dest" the whole "cp" phase should be avoided, right ?

[quote]Only the first level benefits from sync. I always run the longer-period
rsnapshot calls first, because when combined with use_lazy_deletes, those
runs can be relatively tightly constrained in runtime (though I leave a gap
to my next sync, to let all the lazy deletes clear out).
[/quote]
Ok, but rotating before sync could lead to deleted backups (based on retention)
even if the following sync is not completed.

[quote]How and where? Why not just have the script log somewhere directly?
[/quote]
Yes, is the best solution.

[quote]Running hardlink every time is probably going to take a long time and not do
much most of the time. If you are frequently generating duplicate data
across hosts (like from a script or something), you might either explore
whether you actually need to backup that data, or whether you can dedupe in
targeted directories. I will dedupe periodically either when I'm doing some
other maintenance or when I notice disk usage seems up.
[/quote]
I don't want to dedup across hosts (at the moment).
I would like to dedup across multiple level for the same host, in example:

daily.3 has almost 90% of shared files with daily.2

I've tried to run "hardlink" across all backups for 1 host and saved 12GB

[quote]I just dedupe across my entire volume. The premise of rsnapshot means that
hardlinks are fine, if they weren't the entire thing would fall over. Just
make sure you aren't deduping while rsnapshot is running. On most systems
you cannot hardlink across volumes.
[/quote]
So, you are dedupe across daily.1 then daily.2 and so on or across all
backup leves at
the same time ? In example:

hardlink myserver/daily.1
hardlink myserver/daily.2

or

hardlink myserver/*

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
April 30, 2016 03:08PM
On Sat, Apr 30, 2016 at 3:21 AM, Gandalf Corvotempesta <gandalf.corvotempesta < at > gmail.com ([email]gandalf.corvotempesta < at > gmail.com[/email])> wrote:
[quote]2016-04-30 2:52 GMT+02:00 Scott Hess <scott < at > doubleu.com ([email]scott < at > doubleu.com[/email])>:
[quote]Meta-comment: Often I/O bandwidth or seeks are the limiting factor, and
running things in parallel will convert 5 serial 10-minute jobs into 5
parallel 60-minute jobs.  Make sure you&#39;re testing for that in your case.
Especially the cp -al and rm -rf phases are unlikely to enjoy competing with
each other for resources, whereas rsync phases could plausibly timeshare
with each other.
[/quote]
This is something that I&#39;ve not considered, you are right.
What if i&#39;ll use "link-dest" and not "cp -al" ? it should avoid the cp phase.
[/quote]

I/O is I/O, you can shift it around, but this won&#39;t get rid of it.
 
[quote] My biggest issue is to finish all rsync phase as fast as possible, to keep
load on backupped server low. I&#39;m running rsync in the middle of the night,
running backup serially (I have to backup 98 servers) will result having
some rsync running during the day. This is not what I want.
[/quote]

Do you want to run as fast as possible on the system being backed up?  Because in that case it will be faster to back up a single server at a time, so that backups don&#39;t contend with each other on I/O.

Do you want to run with as little load as possible?  Because in that case running as fast as possible is likely to run at a higher load than spreading things out.

Your last couple lines make it sound like you&#39;re having an issue where the overall backup is taking too long.  Rather than trying to parallelize 98 backups, I&#39;d scan rsnapshot.log and look to see if there aren&#39;t five or six backups which are causing the majority of the problems.  Probably if you take those five or six and put them in a separate rsnapshot flow, you can end up with 2 parallel systems which modestly compete with each other, rather than trying to have a single system with weird performance issues.  In that case you could even consider doing things like putting one set of servers on a different volume so they aren&#39;t stepping on each other&#39;s toes I/O wise.

[quote] "cp -al" is run on backup server, thus I don&#39;t care about load.
[/quote]

The "cp -al" pass runs before the rsync, so on the backup server in isolation.  Using link-dest would push the "cp -al" I/O into the rsync itself, so the rsync will likely take longer.

[quote] As improvement, I can run "rsnapshot sync" in parallel, and after all syncs,
I can run the daily rotation, even sequeantially.
[/quote]

Running the sync in parallel should mostly mean running all of the rsync calls in parallel, so that sounds like what you need.
 
[quote] But with "link-dest" the whole "cp" phase should be avoided, right ?
[/quote]

You avoid the point-in-time "cp -al" phase.  You cannot avoid the I/O overhead of creating the directory structure and populating it with hardlinks.

[quote] > Only the first level benefits from sync.  I always run the longer-period
[quote]rsnapshot calls first, because when combined with use_lazy_deletes, those
runs can be relatively tightly constrained in runtime (though I leave a gap
to my next sync, to let all the lazy deletes clear out).
[/quote]
Ok, but rotating before sync could lead to deleted backups (based on retention)
even if the following sync is not completed.
[/quote]

Don&#39;t know what you&#39;re getting at.  If sync is failing, you aren&#39;t backing up current data, but you won&#39;t lose material amounts of data until sync has been failing for weeks, don&#39;t let yourself get into that situation.

[quote] > Running hardlink every time is probably going to take a long time and not do
[quote]much most of the time.  If you are frequently generating duplicate data
across hosts (like from a script or something), you might either explore
whether you actually need to backup that data, or whether you can dedupe in
targeted directories.  I will dedupe periodically either when I&#39;m doing some
other maintenance or when I notice disk usage seems up.
[/quote]
I don&#39;t want to dedup across hosts (at the moment).
I would like to dedup across multiple level for the same host, in example:

daily.3 has almost 90% of shared files with daily.2

I&#39;ve tried to run "hardlink" across all backups for 1 host and saved 12GB
[/quote]

If you have a lot of identical files between daily.3 and daily.2 and they aren&#39;t being hardlinked, then there&#39;s something wrong with your system.  The entire point of rsnapshot is that if a file doesn&#39;t change between two backups, then it should be hardlinked.

[quote] > I just dedupe across my entire volume.  The premise of rsnapshot means that
[quote]hardlinks are fine, if they weren&#39;t the entire thing would fall over.  Just
make sure you aren&#39;t deduping while rsnapshot is running.  On most systems
you cannot hardlink across volumes.
[/quote]
So, you are dedupe across daily.1 then daily.2 and so on or across all
backup leves at
the same time ? In example:

hardlink myserver/daily.1
hardlink myserver/daily.2

or

hardlink myserver/*
[/quote]

I dedupe across the entire volume, so that if hosts share files (say they are comparable Ubuntu installs or something), they don&#39;t duplicate.

-scott
Parallelism and deduplication
April 30, 2016 07:33PM
On Sat, Apr 30, 2016 at 6:04 PM, Scott Hess <scott < at > doubleu.com> wrote:
[quote]On Sat, Apr 30, 2016 at 3:21 AM, Gandalf Corvotempesta
<gandalf.corvotempesta < at > gmail.com> wrote:
[/quote]
[quote][quote]"cp -al" is run on backup server, thus I don't care about load.
[/quote]

The "cp -al" pass runs before the rsync, so on the backup server in
isolation. Using link-dest would push the "cp -al" I/O into the rsync
itself, so the rsync will likely take longer.
[/quote]
"cp -al" is lightning fast: it's talking only to the local filesystem,
not to any seprate "rsync" process which may have to intelligently
parse and complain file systems. Don't discard this unless you have
to.

[quote][quote]As improvement, I can run "rsnapshot sync" in parallel, and after all
syncs,
I can run the daily rotation, even sequeantially.
[/quote]

Running the sync in parallel should mostly mean running all of the rsync
calls in parallel, so that sounds like what you need.

[quote]
But with "link-dest" the whole "cp" phase should be avoided, right ?
[/quote]

You avoid the point-in-time "cp -al" phase. You cannot avoid the I/O
overhead of creating the directory structure and populating it with
hardlinks.
[/quote]
Which is typically signifacntly faster than doing sync funkiness later.

[quote][quote][quote]Only the first level benefits from sync. I always run the longer-period
rsnapshot calls first, because when combined with use_lazy_deletes,
those
runs can be relatively tightly constrained in runtime (though I leave a
gap
to my next sync, to let all the lazy deletes clear out).
[/quote]
Ok, but rotating before sync could lead to deleted backups (based on
retention)
even if the following sync is not completed.
[/quote][/quote]
Do the fast ones first. If the slow ones run long, the short ones will
be missed or delayed, and if parallelized they will overlap in I/O,
bandwidth, and CPU resources. If you do the fast ones first, they're
out of the way and the long ones can use reasonably configured lock
files to say "I haven't finished, I'll do the next update!!"

[quote]Don't know what you're getting at. If sync is failing, you aren't backing
up current data, but you won't lose material amounts of data until sync has
been failing for weeks, don't let yourself get into that situation.

[quote][quote]Running hardlink every time is probably going to take a long time and
not do
much most of the time. If you are frequently generating duplicate data
across hosts (like from a script or something), you might either explore
whether you actually need to backup that data, or whether you can dedupe
in
targeted directories. I will dedupe periodically either when I'm doing
some
other maintenance or when I notice disk usage seems up.
[/quote]
I don't want to dedup across hosts (at the moment).
I would like to dedup across multiple level for the same host, in example:

daily.3 has almost 90% of shared files with daily.2

I've tried to run "hardlink" across all backups for 1 host and saved 12GB
[/quote]

If you have a lot of identical files between daily.3 and daily.2 and they
aren't being hardlinked, then there's something wrong with your system. The
entire point of rsnapshot is that if a file doesn't change between two
backups, then it should be hardlinked.
[/quote]
Try hardlinking *one* backup of one host. You may find considerable
amounts of file duplication, such as duplicate copies of multiple
source trees for the same person working on the same host. I've had
this happen when working on kernels or other bulky source trees.

It can also be dangerous as futz to restore from this, because you can
hardcode links *inside* source trees which *should* be broken, and
where copying to one modifies both source trees. Been there, done
that, even seen it with identical files inside mysql databases where
they multiple databases had the same MyISAM tables copied into them
directly. Hilarity ensued.....

[quote][quote][quote]I just dedupe across my entire volume. The premise of rsnapshot means
that
hardlinks are fine, if they weren't the entire thing would fall over.
Just
make sure you aren't deduping while rsnapshot is running. On most
systems
you cannot hardlink across volumes.
[/quote]
So, you are dedupe across daily.1 then daily.2 and so on or across all
backup leves at
the same time ? In example:

hardlink myserver/daily.1
hardlink myserver/daily.2

or

hardlink myserver/*
[/quote]

I dedupe across the entire volume, so that if hosts share files (say they
are comparable Ubuntu installs or something), they don't duplicate.

-scott
[/quote]
See above for some of the risks. I *hope* whatever restoration
techniques you use are intelligent enough to break, or set, hardlinks
as you need, because there are dangers.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
April 30, 2016 09:08PM
On Sat, 30 Apr 2016 12:21:44 +0200
Gandalf Corvotempesta <gandalf.corvotempesta < at > gmail.com> wrote:

[quote]2016-04-30 2:52 GMT+02:00 Scott Hess <scott < at > doubleu.com>:
[quote]Meta-comment: Often I/O bandwidth or seeks are the limiting factor,
and running things in parallel will convert 5 serial 10-minute jobs
into 5 parallel 60-minute jobs. Make sure you're testing for that
in your case. Especially the cp -al and rm -rf phases are unlikely
to enjoy competing with each other for resources, whereas rsync
phases could plausibly timeshare with each other.
[/quote]
This is something that I've not considered, you are right.
What if i'll use "link-dest" and not "cp -al" ? it should avoid the cp
phase.

My biggest issue is to finish all rsync phase as fast as possible, to
keep load on backupped server low. I'm running rsync in the middle of
the night, running backup serially (I have to backup 98 servers) will
result having some rsync running during the day. This is not what I
want.

"cp -al" is run on backup server, thus I don't care about load.

As improvement, I can run "rsnapshot sync" in parallel, and after all
syncs, I can run the daily rotation, even sequeantially.

But with "link-dest" the whole "cp" phase should be avoided, right ?

[quote]Only the first level benefits from sync. I always run the
longer-period rsnapshot calls first, because when combined with
use_lazy_deletes, those runs can be relatively tightly constrained
in runtime (though I leave a gap to my next sync, to let all the
lazy deletes clear out).
[/quote]
Ok, but rotating before sync could lead to deleted backups (based on
retention) even if the following sync is not completed.

[quote]How and where? Why not just have the script log somewhere
directly?
[/quote]
Yes, is the best solution.

[quote]Running hardlink every time is probably going to take a long time
and not do much most of the time. If you are frequently generating
duplicate data across hosts (like from a script or something), you
might either explore whether you actually need to backup that data,
or whether you can dedupe in targeted directories. I will dedupe
periodically either when I'm doing some other maintenance or when I
notice disk usage seems up.
[/quote]
I don't want to dedup across hosts (at the moment).
I would like to dedup across multiple level for the same host, in
example:

daily.3 has almost 90% of shared files with daily.2

I've tried to run "hardlink" across all backups for 1 host and saved
12GB

[quote]I just dedupe across my entire volume. The premise of rsnapshot
means that hardlinks are fine, if they weren't the entire thing
would fall over. Just make sure you aren't deduping while rsnapshot
is running. On most systems you cannot hardlink across volumes.
[/quote]
So, you are dedupe across daily.1 then daily.2 and so on or across all
backup leves at
the same time ? In example:

hardlink myserver/daily.1
hardlink myserver/daily.2

or

hardlink myserver/*

[/quote]

The most important thing is to understand what truly needs to be backed
up. Looking at your exclusion lists (in a prior thread element) leads me
to believe that you might be backing up a bunch of stuff that's really
OS package data, and not your own specific unique data. Understand that
you will never restore an OS from this backup. Indeed, having any OS
data could create major issues for you during a restore operation later.

That said, if you are pulling a bunch of similar nodes to a central
host, maybe you should rethink what you really need. Adding another
disk to each box that only handles backups seriously ups the security
of the data, reduces backup time and resource requirements, while
eliminating network bandwidth consumption. From that backup, the
'previous' node numerically (current == node0008, previous == node0007)
could then pull current's backup, niced way down (bwlimit in rsync),
over the network.

Now, the data exists in three places. In this model there is no
centralized server or point of failure. 3 separate physical disks
scattered between 2 boxes would need to fail before you'd lose data.

Centralization is a double-edged sword. Sure, single point of
management, but if the central backup server craps out, everything is
at risk.

In almost every analysis of ensuring redundancy, a distributed system
wins.

--
Regards,
Christopher

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 01, 2016 09:23AM
Il 01/05/2016 00:04, Scott Hess ha scritto:
[quote]Do you want to run as fast as possible on the system being backed up?
Because in that case it will be faster to back up a single server at a
time, so that backups don't contend with each other on I/O.

Do you want to run with as little load as possible? Because in that
case running as fast as possible is likely to run at a higher load
than spreading things out.
[/quote]
In my use-case, i've seen that running 1 backup per time would require
the whole night and half day
Running 5 backups in parallels, starting at midnight, will end at about
7:30-8:00

[quote]Your last couple lines make it sound like you're having an issue where
the overall backup is taking too long. Rather than trying to
parallelize 98 backups,
[/quote]
I don't parallelize 98 backups. I parallelize 5 to 10 backups (still
trying to guess the best value).
I have 98 backups to do with this server, not 98 backups in parallel.

[quote]The "cp -al" pass runs before the rsync, so on the backup server in
isolation. Using link-dest would push the "cp -al" I/O into the rsync
itself, so the rsync will likely take longer.
[/quote]Running "cp -al" before the rsync would lead to a delay running rsync,
thus i'll end with backups running in the morning.
[quote]Don't know what you're getting at. If sync is failing, you aren't
backing up current data, but you won't lose material amounts of data
until sync has been failing for weeks, don't let yourself get into
that situation.
[/quote]Not really.
standard rsnapshot process will start rotating all backups. rotation
will remove the expired ones.
I need 2 weekly and 14 daily backups.
On daily.13, when rshapshot move that to weekly.0, the older weekly.1 is
removed.
If next sync will fail, weekly.1 is still gone.
With my scheduler, if sync fails, all rotation are stopped and thus i
preserve the whole backup tree.
[quote]I dedupe across the entire volume, so that if hosts share files (say
they are comparable Ubuntu installs or something), they don't duplicate.
[/quote]
Good idea. 90% of my servers are Debian 7.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 01, 2016 09:34AM
Il 01/05/2016 04:31, Nico Kadel-Garcia ha scritto:
[quote]"cp -al" is lightning fast: it's talking only to the local filesystem,
not to any seprate "rsync" process which may have to intelligently
parse and complain file systems. Don't discard this unless you have to.
[/quote]rsnapsnot man page advice using link_dest.
BTW i can try to use "cp -al" and see if next rsync would last less.

[quote]Try hardlinking *one* backup of one host. You may find considerable
amounts of file duplication, such as duplicate copies of multiple
source trees for the same person working on the same host. I've had
this happen when working on kernels or other bulky source trees. It
can also be dangerous as futz to restore from this, because you can
hardcode links *inside* source trees which *should* be broken, and
where copying to one modifies both source trees. Been there, done
that, even seen it with identical files inside mysql databases where
they multiple databases had the same MyISAM tables copied into them
directly. Hilarity ensued.....
[/quote]Don't see the issue.
If identical MyISAM tables are copied, they can be hardlinked.
On next backup, if tables would differ, the new MyISAM files are
backupped and not hardlinked
[quote]See above for some of the risks. I *hope* whatever restoration
techniques you use are intelligent enough to break, or set, hardlinks
as you need, because there are dangers.
[/quote]Isn't rsync smart enough to restore the file *content* and not the file
hardlink ?
How can I detect an hardlink coming from the server from an hardlink
made by rsnapshot during the backup?
Hardlink coming from servers should be restored "as-is", hardlink made
by rsnapshot should
be resolved and restored as single file.

On backup server they are both hard-links....

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 01, 2016 09:46AM
Il 01/05/2016 06:06, Christopher Barry ha scritto:
[quote]The most important thing is to understand what truly needs to be
backed up. Looking at your exclusion lists (in a prior thread element)
leads me to believe that you might be backing up a bunch of stuff
that's really OS package data, and not your own specific unique data.
Understand that you will never restore an OS from this backup. Indeed,
having any OS data could create major issues for you during a restore
operation later.
[/quote]Why?
Almost all of my server are virtual.
In the past, when I had to restore the whole server, I did a clone from
another, and restored the whole backup, even the OS.

One of Linux advantages is that everything is a file, thus everything
could be backupped and restored.

[quote]That said, if you are pulling a bunch of similar nodes to a central
host, maybe you should rethink what you really need. Adding another
disk to each box that only handles backups seriously ups the security
of the data, reduces backup time and resource requirements, while
eliminating network bandwidth consumption. From that backup, the
'previous' node numerically (current == node0008, previous ==
node0007) could then pull current's backup, niced way down (bwlimit in
rsync), over the network. Now, the data exists in three places. In
this model there is no centralized server or point of failure. 3
separate physical disks scattered between 2 boxes would need to fail
before you'd lose data. Centralization is a double-edged sword. Sure,
single point of management, but if the central backup server craps
out, everything is at risk. In almost every analysis of ensuring
redundancy, a distributed system wins.
[/quote]
In my environment I have 6 backup server, not 1.
What I've posted here is a simplification, just a POC on what I'm
working on: parallelism and dedup, not an exact mirror of my environemnt.

I'm backing up tons of small VM, about 80-90 for each backup servers.
Most of them are almost identical (for example, varnish node 1,2,3,4,5,6
are all the same, DNS1,2,3,4 are identical, MX1,2,3,4 are identical and
so on).
Backupping them totally, in my environment, is easier than choose what
to backup and what to ignore, because I have the same VM template with
the same rsync configuration cloned multiple times and rsnapshot
automatically detect the new server (my hostlist in
/etc/rsnapshot.d/hosts/*.conf is created dynamically) and run the backup
every night.

Filtering out some servers is a waste of time. It's much easier to run
something like:
$ ping -c1 ${ip}
across my subnets once a day and create the configuration file
automatically if host is up

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 01, 2016 10:15AM
On Sun, May 1, 2016 at 9:21 AM, Gandalf Corvotempesta <gandalf.corvotempesta < at > gmail.com ([email]gandalf.corvotempesta < at > gmail.com[/email])> wrote:
[quote]Il 01/05/2016 00:04, Scott Hess ha scritto:
[quote] Do you want to run as fast as possible on the system being backed up?  Because in that case it will be faster to back up a single server at a time, so that backups don&#39;t contend with each other on I/O.

Do you want to run with as little load as possible? Because in that case running as fast as possible is likely to run at a higher load than spreading things out.
[/quote]
In my use-case, i&#39;ve seen that running 1 backup per time would require the whole night and half day
Running 5 backups in parallels, starting at midnight, will end at about 7:30-8:00[/quote]

So your goal is to get the backups done before a particular time, but you don&#39;t actually care about load on the target server if that happens?

For most reasonable backup servers, you should have more than enough CPU to run parallel backups, but you might want to pay attention to having enough I/O capacity and memory.  I&#39;d be nervous about running parallel backups on the same spindles, unless you have something which nicely spreads the load across spindles.

[quote] [quote] The "cp -al" pass runs before the rsync, so on the backup server in isolation.  Using link-dest would push the "cp -al" I/O into the rsync itself, so the rsync will likely take longer.
[/quote] Running "cp -al" before the rsync would lead to a delay running rsync, thus i&#39;ll end with backups running in the morning.[/quote]

Then run it earlier?

Using sync_first, the "cp -al" phase happens during rotation (hourly, daily, etc), while the rsync happens during sync.

[quote] [quote] Don&#39;t know what you&#39;re getting at.  If sync is failing, you aren&#39;t backing up current data, but you won&#39;t lose material amounts of data until sync has been failing for weeks, don&#39;t let yourself get into that situation.
[/quote] Not really.
standard rsnapshot process will start rotating all backups. rotation will remove the expired ones.
I need 2 weekly and 14 daily backups.
On daily.13, when rshapshot move that to weekly.0, the older weekly.1 is removed.
If next sync will fail, weekly.1 is still gone.
With my scheduler, if sync fails, all rotation are stopped and thus i preserve the whole backup tree.[/quote]

If everything was successful, you&#39;d have deleted the oldest backup, so obviously you don&#39;t really care about any data which is only present in that backup.  Having 14 backup directories rather than 13 isn&#39;t more successful or less successful, having sync breaking means you are not backing up current data, which is definitely a problem.

Put another way, if you have things like "rotate then sync" and the sync fails, you should raise all the alarms you have and deal with that IMMEDIATELY.  If you have things like "sync then rotate" and the sync fails, you should raise all the alarms you have an deal with that IMMEDIATELY.  There&#39;s really not much difference, here, IMHO.  If someone wants to restore something from yesterday and you find that you have a complete set of 14 daily backups that are from 6 weeks ago, nobody is going to be happy about that.

-scott
Parallelism and deduplication
May 01, 2016 10:39AM
Il 01/05/2016 19:12, Scott Hess ha scritto:
[quote]So your goal is to get the backups done before a particular time, but
you don't actually care about load on the target server if that happens?
[/quote]
Yes and no.
Having backup running up to 07:00 in the morning, will result in low
load on server and almost 0 I/O contention with other services.
Running during the day, even with "nice 19" and "ionice idle" will
result in backup taking the whole day and still high load.

[quote]For most reasonable backup servers, you should have more than enough
CPU to run parallel backups, but you might want to pay attention to
having enough I/O capacity and memory. I'd be nervous about running
parallel backups on the same spindles, unless you have something which
nicely spreads the load across spindles.
[/quote]I'm using hardware RAID-10
I/O is spread across 12 disks (6+6)

[quote]Then run it earlier?
[/quote]
From 08:00 to 23:00 i can't.
[quote]
Put another way, if you have things like "rotate then sync" and the
sync fails, you should raise all the alarms you have and deal with
that IMMEDIATELY. If you have things like "sync then rotate" and the
sync fails, you should raise all the alarms you have an deal with that
IMMEDIATELY. There's really not much difference, here, IMHO. If
someone wants to restore something from yesterday and you find that
you have a complete set of 14 daily backups that are from 6 weeks ago,
nobody is going to be happy about that.
[/quote]Let's assume I'm on vacation or i'm unable to deal immediatly with the
issue for whatever reason.
In the first case, i'm still loosing restore points. rotation still
happens, older backups are deleted and so on.
In the second case, new backups are not made due to the issue, but the
older one are still there.

But, yes, due to the nature of rsnapshot, having 13 days or 14 doesn't
make any difference.

One question: is daily.13 a "standalone" backup? Can I move it to
another server, to a dvd, or something else and still have the whole
backup available ? Can I remove everything up to daily.13
(daily.0....12) and still have the daily.13 available and complete?

In bacula, removing the "full" backup means breaking everything else
after it. Probably this is why I'm trying to stop the rotation if backup
is failing. Rotating/Purging bacula volume could lead to all backup
broken, in example:

full
incr
incr
incr
incr
incr
differential
incr
incr
incr
incr
full not executed properly and backup stopped here for some days

the last full is not completed, but Bacula is still expiring the older
ones based on retention period.
When the first full is purged, you have lost ALL backups for that server.
If you did a mistake in rotation period, you could loose all backups
with just 1 failure. Probably, this can't happen with rsnapshot as all
backups are hardlinked and not based on the previous one.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 01, 2016 11:33AM
On Sun, May 1, 2016 at 10:35 AM, Gandalf Corvotempesta <gandalf.corvotempesta < at > gmail.com ([email]gandalf.corvotempesta < at > gmail.com[/email])> wrote:
[quote]Il 01/05/2016 19:12, Scott Hess ha scritto:
[quote] For most reasonable backup servers, you should have more than enough CPU to run parallel backups, but you might want to pay attention to having enough I/O capacity and memory.  I&#39;d be nervous about running parallel backups on the same spindles, unless you have something which nicely spreads the load across spindles.
[/quote] I&#39;m using hardware RAID-10
I/O is spread across 12 disks (6+6)
[/quote]

Read I/O across 12, write I/O across 6, that probably gives you headroom to run a certain number of parallel rsync processes without too much backpressure between them.  You may want to monitor timestamps on your .sync directories, though, to make certain they don&#39;t suddenly start rising all at the same time.

[quote][quote]Then run it earlier?
[/quote]
[quote]From 08:00 to 23:00 i can&#39;t.[/quote]
[/quote]
You can&#39;t run the sync pass (the one full of rsyncs), or you can&#39;t run the rotation (cp -al and rm -rf)?

If this is driven by a need to maintain 14 days of backups, then why not run the rotation a little earlier and retain 15 days?  When using sync_first, the rotation only operates on the backup server, no rsync to remote servers.

[quote][quote] Put another way, if you have things like "rotate then sync" and the sync fails, you should raise all the alarms you have and deal with that IMMEDIATELY.  If you have things like "sync then rotate" and the sync fails, you should raise all the alarms you have an deal with that IMMEDIATELY.  There&#39;s really not much difference, here, IMHO.  If someone wants to restore something from yesterday and you find that you have a complete set of 14 daily backups that are from 6 weeks ago, nobody is going to be happy about that.
[/quote] Let&#39;s assume I&#39;m on vacation or i&#39;m unable to deal immediatly with the issue for whatever reason.
In the first case, i&#39;m still loosing restore points. rotation still happens, older backups are deleted and so on.
In the second case, new backups are not made due to the issue, but the older one are still there.
But, yes, due to the nature of rsnapshot, having 13 days or 14 doesn&#39;t make any difference.
[/quote]

I guess in the end the easiest way to have the system grind to a stop if sync fails is the way you should go with.

BUT, keep in mind that with that many systems to backup, it seems likely that sync is going to fail sometimes for some systems, so the question is what to do when one system&#39;s sync is failing but the rest are fine?

[quote] One question: is daily.13 a "standalone" backup? Can I move it to another server, to a dvd, or something else and still have the whole backup available ? Can I remove everything up to daily.13 (daily.0....12) and still have the daily.13 available and complete?
[/quote]

Each snapshot should be a complete copy of the directory structure, with all unique files uniquely present, and all files shared with the previous backup hardlinked.  So in my snapshot root, I can do:

# ls -li daily.?/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.0/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.1/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.2/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.3/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.4/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.5/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.6/nbackup/bin/ls

They&#39;re all the same file (other refs are in 4x hourly.?, 5x weekly.?, and .sync).  I could copy or move daily.3 to a different disk, and the files in all the others would still be present.  There are no incrementals.

-scott
Parallelism and deduplication
May 01, 2016 12:38PM
On Sun, 1 May 2016 18:44:40 +0200
Gandalf Corvotempesta <gandalf.corvotempesta < at > gmail.com> wrote:

[quote]Il 01/05/2016 06:06, Christopher Barry ha scritto:
[quote]The most important thing is to understand what truly needs to be
backed up. Looking at your exclusion lists (in a prior thread
element) leads me to believe that you might be backing up a bunch of
stuff that's really OS package data, and not your own specific
unique data. Understand that you will never restore an OS from this
backup. Indeed, having any OS data could create major issues for you
during a restore operation later.
[/quote]Why?
[/quote]
because you're unwittingly asking for pain.

[quote]Almost all of my server are virtual.
In the past, when I had to restore the whole server, I did a clone
from another, and restored the whole backup, even the OS.

One of Linux advantages is that everything is a file, thus everything
could be backupped and restored.
[/quote]
open files will bite you. versions will bite you.

[quote]

[quote]That said, if you are pulling a bunch of similar nodes to a central
host, maybe you should rethink what you really need. Adding another
disk to each box that only handles backups seriously ups the
security of the data, reduces backup time and resource requirements,
while eliminating network bandwidth consumption. From that backup,
the 'previous' node numerically (current == node0008, previous ==
node0007) could then pull current's backup, niced way down (bwlimit
in rsync), over the network. Now, the data exists in three places.
In this model there is no centralized server or point of failure. 3
separate physical disks scattered between 2 boxes would need to fail
before you'd lose data. Centralization is a double-edged sword.
Sure, single point of management, but if the central backup server
craps out, everything is at risk. In almost every analysis of
ensuring redundancy, a distributed system wins.
[/quote]
In my environment I have 6 backup server, not 1.
[/quote]
and it sounds like you need them, given your use case and methodology.

[quote]What I've posted here is a simplification, just a POC on what I'm
working on: parallelism and dedup, not an exact mirror of my
environemnt.

I'm backing up tons of small VM, about 80-90 for each backup servers.
Most of them are almost identical (for example, varnish node
1,2,3,4,5,6 are all the same, DNS1,2,3,4 are identical, MX1,2,3,4 are
identical and so on).
Backupping them totally, in my environment, is easier than choose what
to backup and what to ignore, because I have the same VM template with
the same rsync configuration cloned multiple times and rsnapshot
automatically detect the new server (my hostlist in
/etc/rsnapshot.d/hosts/*.conf is created dynamically) and run the
backup every night.
[/quote]
Just a very 'sledgehammer' approach IMHO.

Initially I assumed we were dealing with a lot of baremetal servers.
Either way, my point is you're approach is not really scalable.

Consider your use of VMs here. You copy the same template OS image a
bunch of times, then change the configuration data inside each copy to
'personalize' them. Sans data, these images are likely 99.95%
identical. Apparently, you're also storing the node's data inside these
image files as well. All you really care about is what makes this node
unique.

A better way to have a bunch of identical varnish, or dns, or mail, or
whatever servers would be to PXE boot a single read-only generic vm
image as many times as required, likely using iPXE[1], with a writable
disk file as an overlay fs for configuration and/or data per node. This
writable overlay is the node's 'personality' if you will. You'd get the
node's identity via dhcp of course. You might also attach to and/or
mount additional needed unique data space from elsewhere using iscsi or
nfs, or some other method. The writable config images and other mounted
space is the data you'd backup with rsnapshot, never *any* of the OS
image itself this way.

You'd backup that base image just once somewhere. Update the image,
deploy it, and you've just updated *all* vms after they reboot. Try out
a completely new image with one of the existing config images, and
easily roll back if there's an issue. The bootable primary image is
read-only, so to recover from a compromise requires only a simple clean
reboot. Any nefarious code or data is easily viewable in the config
overlay image for forensics.

One good use of centralization though is a log server to catch all the
nodes' logs in a single place. That makes log analysis simple for all
nodes, and keeps the config images light.

You need parallel and de-duplication now because your fundamental
understanding of how best to deploy a lot of similar VM nodes is
insufficient. My suggestion was for you to rethink the underlying
problem, not how to best make up for a flawed design later. I'm not
trying to be confrontational Gandalf, just sharing hard-won knowledge
from a lifetime of systems administration. You are designing a fix to a
problem that is really a symptom of a non-scalable, non-optimal
methodology. I am merely trying to help you view this from a different
perspective. One that I will agree can be difficult to see while you're
embroiled in keeping your head above water.

I wish you only good luck in however you choose to solve your problems.

-C

[1] http://ipxe.org/

[quote]
Filtering out some servers is a waste of time. It's much easier to run
something like:
$ ping -c1 ${ip}
across my subnets once a day and create the configuration file
automatically if host is up

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager Applications Manager provides deep performance insights into
multiple tiers of your business applications. It resolves application
problems quickly and reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
[/quote]

--
Regards,
Christopher

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 01, 2016 08:27PM
On Sun, May 1, 2016 at 3:35 PM, Christopher Barry
<christopher.r.barry < at > gmail.com> wrote:
[quote]On Sun, 1 May 2016 18:44:40 +0200
Gandalf Corvotempesta <gandalf.corvotempesta < at > gmail.com> wrote:

[quote]Il 01/05/2016 06:06, Christopher Barry ha scritto:
[quote]The most important thing is to understand what truly needs to be
backed up. Looking at your exclusion lists (in a prior thread
element) leads me to believe that you might be backing up a bunch of
stuff that's really OS package data, and not your own specific
unique data. Understand that you will never restore an OS from this
backup. Indeed, having any OS data could create major issues for you
during a restore operation later.
[/quote]Why?
[/quote]
because you're unwittingly asking for pain.
[/quote]
I've done it, but it took real caution. Database files that may be in
the midst of an atomic transation, and not yet written to the
filesystem, are a real risk, for relational databases in particular.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 02, 2016 12:35AM
2016-05-01 4:31 GMT+02:00 Nico Kadel-Garcia <nkadel < at > gmail.com>:
[quote][quote]The "cp -al" pass runs before the rsync, so on the backup server in
isolation. Using link-dest would push the "cp -al" I/O into the rsync
itself, so the rsync will likely take longer.
[/quote][/quote]
I don't think it's lightning fast.
I've tried yersterday and "cp -al" took more than 2 hours, plus the
rsync phase (about 1 hours)
rsync with link-dest took 1 hours and 20 minutes.

I'll see this night, yesterday backup run "cp -al" from daily.0 to
.sync and after all, cp -al .sync to daily.0

Probably, becase using link-dest there was no .sync directory.
Let's see this night.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 02, 2016 12:39AM
2016-05-01 21:35 GMT+02:00 Christopher Barry <christopher.r.barry < at > gmail.com>:
[quote]open files will bite you. versions will bite you.
[/quote]
If I have to restore a whole system, there are no open files to restore.
Obviously I restore by using an external system, for example a boot cd with
my (empty) disks mounted on /mnt

[quote]You need parallel and de-duplication now because your fundamental
understanding of how best to deploy a lot of similar VM nodes is
insufficient. My suggestion was for you to rethink the underlying
problem, not how to best make up for a flawed design later. I'm not
trying to be confrontational Gandalf, just sharing hard-won knowledge
from a lifetime of systems administration. You are designing a fix to a
problem that is really a symptom of a non-scalable, non-optimal
methodology. I am merely trying to help you view this from a different
perspective. One that I will agree can be difficult to see while you're
embroiled in keeping your head above water.
[/quote]
your approach is absolutely correct, but you miss some points:

1) the current architecture wasn't planned by me but by someone else
that is not working here anymore.
2) my CEO would not agree to change the whole architecture (running
stable from many many year) to just use a backup software
3) based on point 2, it would be easier, for CEO, to change the backup
software, or, maybe, change the current sysadmin (me) than rethinking
the whole environment.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 02, 2016 12:46AM
2016-05-01 20:30 GMT+02:00 Scott Hess <scott < at > doubleu.com>:
[quote]I guess in the end the easiest way to have the system grind to a stop if
sync fails is the way you should go with.

BUT, keep in mind that with that many systems to backup, it seems likely
that sync is going to fail sometimes for some systems, so the question is
what to do when one system's sync is failing but the rest are fine?
[/quote]
If you read the code that i've posted, you can see that I'm looking
for "sync: completed successfully" for every
host. In case of failure, only that host is "freezed". That's why i'm
using a wrapper script to run rsnapshot in parallel
and not directly rsnapshot

After each backup (the script is run for each host), i'll look for
"sync: completed successfully" in log file.
If found, i'll start rotations. If not, i'll exit and send an email.
But just for that single host.

[quote]Each snapshot should be a complete copy of the directory structure, with all
unique files uniquely present, and all files shared with the previous backup
hardlinked. So in my snapshot root, I can do:

# ls -li daily.?/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.0/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.1/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.2/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.3/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.4/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.5/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.6/nbackup/bin/ls

They're all the same file (other refs are in 4x hourly.?, 5x weekly.?, and
.sync). I could copy or move daily.3 to a different disk, and the files in
all the others would still be present. There are no incrementals.
[/quote]
That's clear.
What I would like to know (or better, i would like to have a confirm) is:

can I copy "daily.3" to somewhere else? In that case, daily.3 would be
a total copy of my backupped server, like a "full" backup, right?
can I delete EVERYTHING except daily.3 and still having the daily.3
fully available ?

Like my example with "Bacula": deleting a full in bacula will result
in all subsequent backups unavailable. With hardlinks this should
never happens, as
hardlinked files are removed only when link count reach to 0. Deleting
everything but 1 directory, will result in link count >= 1 and thus
the whole backup is still available.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 02, 2016 12:48AM
2016-05-02 5:25 GMT+02:00 Nico Kadel-Garcia <nkadel < at > gmail.com>:
[quote]I've done it, but it took real caution. Database files that may be in
the midst of an atomic transation, and not yet written to the
filesystem, are a real risk, for relational databases in particular.
[/quote]
I'm using a pre-xfer script to run a mysqldump on each host.
I don't restore relationals dbs directly from files, but from an sql
dump (taken with read lock and flush tables)

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 04, 2016 06:48AM
On Mon, May 2, 2016 at 3:43 AM, Gandalf Corvotempesta
<gandalf.corvotempesta < at > gmail.com> wrote:
[quote]2016-05-01 20:30 GMT+02:00 Scott Hess <scott < at > doubleu.com>:
[quote]I guess in the end the easiest way to have the system grind to a stop if
sync fails is the way you should go with.

BUT, keep in mind that with that many systems to backup, it seems likely
that sync is going to fail sometimes for some systems, so the question is
what to do when one system's sync is failing but the rest are fine?
[/quote]
If you read the code that i've posted, you can see that I'm looking
for "sync: completed successfully" for every
host. In case of failure, only that host is "freezed". That's why i'm
using a wrapper script to run rsnapshot in parallel
and not directly rsnapshot

After each backup (the script is run for each host), i'll look for
"sync: completed successfully" in log file.
If found, i'll start rotations. If not, i'll exit and send an email.
But just for that single host.

[quote]Each snapshot should be a complete copy of the directory structure, with all
unique files uniquely present, and all files shared with the previous backup
hardlinked. So in my snapshot root, I can do:

# ls -li daily.?/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.0/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.1/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.2/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.3/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.4/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.5/nbackup/bin/ls
564308 -rwxr-xr-x 17 root root 110080 Mar 10 11:10 daily.6/nbackup/bin/ls

They're all the same file (other refs are in 4x hourly.?, 5x weekly.?, and
.sync). I could copy or move daily.3 to a different disk, and the files in
all the others would still be present. There are no incrementals.
[/quote]
That's clear.
What I would like to know (or better, i would like to have a confirm) is:

can I copy "daily.3" to somewhere else? In that case, daily.3 would be
a total copy of my backupped server, like a "full" backup, right?
can I delete EVERYTHING except daily.3 and still having the daily.3
fully available ?
[/quote]
Yup. If you're a weasel, you can even move it aside and put it back
later, as long as you put it bck in the right order with any newly
rotated backups.

Copying a large daily.3 snapshot aside, and making sure it's
consistent, can be tricky if the copy rotates under you. This is why
I've long wanted to change the numbering scheme from "daily.0",
"daily.1", etc. to "daily.20160401010203", "daily.20160402113433", to
use full UTC compatbile YYYYMMDDhhmmss date stamped names. But I've
never gotten the traction to write and submit a patch.,

[quote]Like my example with "Bacula": deleting a full in bacula will result
in all subsequent backups unavailable. With hardlinks this should
never happens, as
hardlinked files are removed only when link count reach to 0. Deleting
everything but 1 directory, will result in link count >= 1 and thus
the whole backup is still available.
[/quote]
Not a problem with rsnapshot and hardlinked copies.

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 04, 2016 08:26AM
On 4 May 2016 at 14:45, Nico Kadel-Garcia <nkadel < at > gmail.com ([email]nkadel < at > gmail.com[/email])> wrote:
[quote]Copying a large daily.3 snapshot aside, and making sure it&#39;s
consistent, can be tricky if the copy rotates under you. This is why
I&#39;ve long wanted to change the numbering scheme from "daily.0",
"daily.1", etc. to "daily.20160401010203", "daily.20160402113433", to
use full UTC compatbile YYYYMMDDhhmmss date stamped names. But I&#39;ve
never gotten the traction to write and submit a patch.,

[/quote]

That would be a great idea. I always have to stop and remember if daily.0 is the most recent or the least recent. And having a timestamp would add useful information that&#39;s otherwise buried in log files.

poc
Parallelism and deduplication
May 04, 2016 09:32AM
Hallo, Patrick,

Du meintest am 04.05.16:

[quote][quote]Copying a large daily.3 snapshot aside, and making sure it's
consistent, can be tricky if the copy rotates under you. This is why
I've long wanted to change the numbering scheme from "daily.0",
"daily.1", etc. to "daily.20160401010203", "daily.20160402113433",
to use full UTC compatbile YYYYMMDDhhmmss date stamped names. But
I've never gotten the traction to write and submit a patch.,

[/quote][/quote]
[quote]That would be a great idea. I always have to stop and remember if
daily.0 is the most recent or the least recent. And having a
timestamp would add useful information that's otherwise buried in log
files.
[/quote]
Sorry - the backups _have_ a time stamp. They don't need the same (or
another) date string in the directory name.

If you can't tell your file manager to show this time stamp then a kind
of copy (including the hard links) needs few place and doesn't disturb
the "rsnapshot" way of rotating backups.

Viele Gruesse!
Helmut

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 04, 2016 09:42AM
Am Mittwoch 4 Mai 16 um 15:45:10 Uhr schrieb "Nico Kadel-Garcia"
<nkadel < at > gmail.com>:
[quote]...
I've long wanted to change the numbering scheme from "daily.0",
"daily.1", etc. to "daily.20160401010203", "daily.20160402113433", to
use full UTC compatbile YYYYMMDDhhmmss date stamped names. But I've
never gotten the traction to write and submit a patch.,
[/quote]
Dafür habe ich do-timed-folders und do-del-timed-folders geschrieben...

http://www.heise.de/download/do-rsnapshots-1184971.html

--
Herzliche Grüße!
Rolf Muth
Meine Adressen dürfen nicht für Werbung verwendet werden!
OpenPGP Public Key:
http://pgp.uni-mainz.de:11371/pks/lookup?op=index&search=0x5544C89A

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 04, 2016 09:53AM
On 4 May 2016 at 17:27, Helmut Hullen <Hullen < at > t-online.de ([email]Hullen < at > t-online.de[/email])> wrote:
[quote]> That would be a great idea. I always have to stop and remember if
[quote]daily.0 is the most recent or the least recent. And having a
timestamp would add useful information that&#39;s otherwise buried in log
files.
[/quote]
Sorry - the backups _have_ a time stamp. They don&#39;t need the same (or
another) date string in the directory name.
[/quote]

Where?

poc
Parallelism and deduplication
May 04, 2016 10:01AM
On Wed, May 4, 2016 at 9:46 AM, Patrick O&#39;Callaghan <pocallaghan < at > gmail.com ([email]pocallaghan < at > gmail.com[/email])> wrote:
[quote]
On 4 May 2016 at 17:27, Helmut Hullen <Hullen < at > t-online.de ([email]Hullen < at > t-online.de[/email])> wrote:
[quote]> That would be a great idea. I always have to stop and remember if
[quote]daily.0 is the most recent or the least recent. And having a
timestamp would add useful information that&#39;s otherwise buried in log
files.
[/quote]
Sorry - the backups _have_ a time stamp. They don&#39;t need the same (or
another) date string in the directory name.
[/quote]

Where?

[/quote]

On Linux, ls -latr is what you want.
-l long output
-a all files (including .sync)
-t sort by time
-r reversed

-scott
 
Parallelism and deduplication
May 04, 2016 10:03AM
Am Mittwoch 4 Mai 16 um 15:45:10 Uhr schrieb "Nico Kadel-Garcia"
<nkadel < at > gmail.com>:
[quote]...
I've long wanted to change the numbering scheme from "daily.0",
"daily.1", etc. to "daily.20160401010203", "daily.20160402113433", to
use full UTC compatbile YYYYMMDDhhmmss date stamped names. But I've
never gotten the traction to write and submit a patch.,
[/quote]
For such purpose I wrote do-timed-folders and do-del-timed-folders ...

http://www.heise.de/download/do-rsnapshots-1184971.html

--
Herzliche Grüße!
Rolf Muth
Meine Adressen dürfen nicht für Werbung verwendet werden!
OpenPGP Public Key:
http://pgp.uni-mainz.de:11371/pks/lookup?op=index&search=0x5544C89A

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Parallelism and deduplication
May 04, 2016 10:15AM
On 4 May 2016 at 17:58, Scott Hess <scott < at > doubleu.com ([email]scott < at > doubleu.com[/email])> wrote:
[quote]On Wed, May 4, 2016 at 9:46 AM, Patrick O&#39;Callaghan <pocallaghan < at > gmail.com ([email]pocallaghan < at > gmail.com[/email])> wrote:
[quote]
On 4 May 2016 at 17:27, Helmut Hullen <Hullen < at > t-online.de ([email]Hullen < at > t-online.de[/email])> wrote:
[quote]> That would be a great idea. I always have to stop and remember if
[quote]daily.0 is the most recent or the least recent. And having a
timestamp would add useful information that&#39;s otherwise buried in log
files.
[/quote]
Sorry - the backups _have_ a time stamp. They don&#39;t need the same (or
another) date string in the directory name.
[/quote]

Where?

[/quote]

On Linux, ls -latr is what you want.
-l long output
-a all files (including .sync)
-t sort by time
-r reversed

[/quote]

I thought you might say that. Unfortunately it&#39;s *not* a timestamp. It&#39;s the latest access (or, with other options, modification) time of a set of files, as seen from the rsnapshot server. There are some problems with this:

1) It&#39;s only visible to users with access to the server&#39;s filesystem (via login or remote mount).

2) It doesn&#39;t indicate when the backup was run, only when the most recent file modification happened (and that&#39;s assuming you can access all the files and not just your own). They are not the same thing.

poc
Parallelism and deduplication
May 04, 2016 10:34AM
On Wed, May 4, 2016 at 10:12 AM, Patrick O&#39;Callaghan <pocallaghan < at > gmail.com ([email]pocallaghan < at > gmail.com[/email])> wrote:
[quote]On 4 May 2016 at 17:58, Scott Hess <scott < at > doubleu.com ([email]scott < at > doubleu.com[/email])> wrote:
[quote]On Wed, May 4, 2016 at 9:46 AM, Patrick O&#39;Callaghan <pocallaghan < at > gmail.com ([email]pocallaghan < at > gmail.com[/email])> wrote:
[quote]On 4 May 2016 at 17:27, Helmut Hullen <Hullen < at > t-online.de ([email]Hullen < at > t-online.de[/email])> wrote:
[quote]> That would be a great idea. I always have to stop and remember if
[quote]daily.0 is the most recent or the least recent. And having a
timestamp would add useful information that&#39;s otherwise buried in log
files.
[/quote]
Sorry - the backups _have_ a time stamp. They don&#39;t need the same (or
another) date string in the directory name.
[/quote]

Where?

[/quote]

On Linux, ls -latr is what you want.
-l long output
-a all files (including .sync)
-t sort by time
-r reversed

[/quote]

I thought you might say that. Unfortunately it&#39;s *not* a timestamp. It&#39;s the latest access (or, with other options, modification) time of a set of files, as seen from the rsnapshot server. There are some problems with this:

1) It&#39;s only visible to users with access to the server&#39;s filesystem (via login or remote mount).

[/quote]

How do users get access to the in-the-directory-name timestamps if they don&#39;t have access to the filesystem where those directories live?

[quote]2) It doesn&#39;t indicate when the backup was run, only when the most recent file modification happened (and that&#39;s assuming you can access all the files and not just your own). They are not the same thing.

[/quote]

Rsnapshot sets the directory timestamp using touch, I think as the last thing it does.  It&#39;s intentional, not some sort of side effect.  I&#39;m not sure how renaming the directory would somehow be stronger than that.  If the actual timestamp changes after that, it&#39;s because someone is messing around with the files, which in my experience should be avoided at all cost.

For my rsnapshot root, the timestamps are ~40s after the time when the cron job is scheduled to run.  They were ~20m until I changed to sync&&hourly, with a sync pass run about an hour ahead of time.

Personally, I wouldn&#39;t mind having a .timestamp file in the snapshot, which would contain a printed form of the timestamp rsnapshot used for the touch, and perhaps with the same timestamp as the directory using touch -r.  That file would never need to be renamed after the snapshot is taken, so it would be like a two-line patch.  Something like that would be more reliable for scripting, as the script could either use the filesystem timestamp or parse the printed timestamp, and the printed timestamp could include high precision without needing config options.

-scott
Parallelism and deduplication
May 04, 2016 03:02PM
On 4 May 2016 at 18:30, Scott Hess <scott < at > doubleu.com ([email]scott < at > doubleu.com[/email])> wrote:
[quote][quote]I thought you might say that. Unfortunately it&#39;s *not* a timestamp. It&#39;s the latest access (or, with other options, modification) time of a set of files, as seen from the rsnapshot server. There are some problems with this:

1) It&#39;s only visible to users with access to the server&#39;s filesystem (via login or remote mount).

[/quote]

How do users get access to the in-the-directory-name timestamps if they don&#39;t have access to the filesystem where those directories live?
[/quote]

That&#39;s a fair point. Nevertheless, being able to list a set of directory names requires very little rights compared to listing all files with their access times. The only requirement is read access to the top-level directory where the backups live.
 
[quote][quote]2) It doesn&#39;t indicate when the backup was run, only when the most recent file modification happened (and that&#39;s assuming you can access all the files and not just your own). They are not the same thing.

[/quote]

Rsnapshot sets the directory timestamp using touch, I think as the last thing it does.  It&#39;s intentional, not some sort of side effect.  I&#39;m not sure how renaming the directory would somehow be stronger than that.  If the actual timestamp changes after that, it&#39;s because someone is messing around with the files, which in my experience should be avoided at all cost.
[/quote]

Perhaps, but that behaviour is not documented as far as I know.
 
[quote]For my rsnapshot root, the timestamps are ~40s after the time when the cron job is scheduled to run.  They were ~20m until I changed to sync&&hourly, with a sync pass run about an hour ahead of time.

Personally, I wouldn&#39;t mind having a .timestamp file in the snapshot, which would contain a printed form of the timestamp rsnapshot used for the touch, and perhaps with the same timestamp as the directory using touch -r.  That file would never need to be renamed after the snapshot is taken, so it would be like a two-line patch.  Something like that would be more reliable for scripting, as the script could either use the filesystem timestamp or parse the printed timestamp, and the printed timestamp could include high precision without needing config options.
[/quote]

That would work, though with slightly more access rights than already mentioned (+x permission on the backup directories and +r on the .timestamp files).

Of course if user backup directories are generally readable by their owners, the permissions question is not an issue. That may be the normal case.

poc
Parallelism and deduplication
May 04, 2016 04:22PM
On Wed, 4 May 2016 10:30:58 -0700
Scott Hess <scott < at > doubleu.com> wrote:

[quote]On Wed, May 4, 2016 at 10:12 AM, Patrick O'Callaghan
<pocallaghan < at > gmail.com> wrote:

[quote]On 4 May 2016 at 17:58, Scott Hess <scott < at > doubleu.com> wrote:

[quote]On Wed, May 4, 2016 at 9:46 AM, Patrick O'Callaghan <
pocallaghan < at > gmail.com> wrote:

[quote]On 4 May 2016 at 17:27, Helmut Hullen <Hullen < at > t-online.de> wrote:

[quote][quote]That would be a great idea. I always have to stop and remember
if daily.0 is the most recent or the least recent. And having a
timestamp would add useful information that's otherwise buried
in log files.
[/quote]
Sorry - the backups _have_ a time stamp. They don't need the same
(or another) date string in the directory name.

[/quote]
Where?

[/quote]
On Linux, ls -latr is what you want.
-l long output
-a all files (including .sync)
-t sort by time
-r reversed

[/quote]
I thought you might say that. Unfortunately it's *not* a timestamp.
It's the latest access (or, with other options, modification) time
of a set of files, as seen from the rsnapshot server. There are some
problems with this:

1) It's only visible to users with access to the server's filesystem
(via login or remote mount).

[/quote]
How do users get access to the in-the-directory-name timestamps if they
don't have access to the filesystem where those directories live?

2) It doesn't indicate when the backup was run, only when the most
recent
[quote]file modification happened (and that's assuming you can access all
the files and not just your own). They are not the same thing.

[/quote]
Rsnapshot sets the directory timestamp using touch, I think as the last
thing it does. It's intentional, not some sort of side effect. I'm
not sure how renaming the directory would somehow be stronger than
that. If the actual timestamp changes after that, it's because
someone is messing around with the files, which in my experience
should be avoided at all cost.

For my rsnapshot root, the timestamps are ~40s after the time when the
cron job is scheduled to run. They were ~20m until I changed to
sync&&hourly, with a sync pass run about an hour ahead of time.

Personally, I wouldn't mind having a .timestamp file in the snapshot,
which would contain a printed form of the timestamp rsnapshot used for
the touch, and perhaps with the same timestamp as the directory using
touch -r. That file would never need to be renamed after the snapshot
[/quote]
in cron, do a:

date > /etc/rsnapshot/.timestamp && rsnapshot <increment>

where /etc/rsnapshot is one of the backed up directories in every
backup.

and bada-bing, Bob's your Uncle. :)

[quote]is taken, so it would be like a two-line patch. Something like that
would be more reliable for scripting, as the script could either use
the filesystem timestamp or parse the printed timestamp, and the
printed timestamp could include high precision without needing config
options.

-scott
[/quote]

--
Regards,
Christopher

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
rsnapshot-discuss mailing list
rsnapshot-discuss < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rsnapshot-discuss
Sorry, only registered users may post in this forum.

Click here to login