SearchFAQMemberlist Log in
Reply to topic Page 3 of 3
Goto page Previous  1, 2, 3
BackupPC_nightly taking very long
Author Message
Post BackupPC_nightly taking very long 
On Mon, 2004-04-05 at 08:16, Craig Barratt wrote:

Please either comment out the line

system("$BinDir/BackupPC_sendEmail");

in BackupPC_nightly, or run $BinDir/BackupPC_sendEmail manually as
the BackupPC user.

$ time ./BackupPC_sendEmail

real 3m33.118s
user 3m30.019s
sys 0m2.384s

Craig

Regards,

--
Guus Houtzager Email: guus < at > houtzager.net
PGP fingerprint = 5E E6 96 35 F0 64 34 14 CC 03 2B 36 71 FB 4B 5D
"A)bort, R)etry, I)nfluence with large hammer."

Post BackupPC_nightly taking very long 
Hello,

sorry for not speaking up earlier, but time and wanting to check out
some things first prevented me from doing so. (Thanks Guus!)

Some months ago (about October/November), my colleague ( Douglas
Thomson) and I were working on this long running of BackupPC_nightly
process problem. We decided that it was because of all the disk seeking
trying to go to each file in turn which were spread all over the place.

My colleague wrote a C program, and patched BackupPC_nightly, and our
overnight run went from over 8 hours, to 10 minutes on an ext3 filesystem.

The way it works is to scan the pool directory (directories only!), and
then process each file in inode number order. The presumption is that
inodes are grouped together on the disk, and so less disk seeking is needed.

Attached are a C program, an example perl program, and a patched
BackupPC_nightly (without all the recent speed-ups pasted on this list
or cvs).

The C program just reads directory info and spits it out, but that dir
info includes the inode number. A couple of shell commands sorts things
into inode order, and the simple find function just visits each file in
turn with the standard visit function.

However, I feel this is really only a temporary hack.

I think the proper way of handling this is to do reference counting.
If a database of the files in the pool was kept (probably db, or maybe
sqlite), a simple db query would give a list of the files to clean up.

Maintaining consistency between the filesystem and the db would probably
be the next trick. But it should be possible to do that in parallel to
normal operations, rather than locking everything up. (Scan filesystem
for candidate files for deletion in parallel, then check them against
the db.)

Another way of looking at it is to see it as a kind of garbage
collection problem. Languages seem to have that under control, so there
may be something to learn from them.

SiMoN



#!/usr/bin/perl
#============================================================= -*-perl-*-
#
# BackupPC_nightly: Nightly cleanup & statistics script.
#
# DESCRIPTION
# BackupPC_nightly performs several administrative tasks:
#
# - monthly aging of per-PC log files
#
# - pruning files from pool no longer used (ie: those with only one
# hard link).
#
# - sending email to users and administrators.
#
# AUTHOR
# Craig Barratt <cbarratt < at > users.sourceforge.net>
#
# COPYRIGHT
# Copyright (C) 2001 Craig Barratt
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
#========================================================================
#
# Version 2.0.2, released 6 Oct 2003.
#
# See http://backuppc.sourceforge.net.
#
#========================================================================

use strict;
no utf8;
use lib "/usr/share/backuppc/lib";
use BackupPC::Lib;
use BackupPC::FileZIO;

use File::Find;
use File::Path;
use Data::Dumper;

die("BackupPC::Lib->new failed\n") if ( !(my $bpc = BackupPC::Lib->new) );
my $TopDir = $bpc->TopDir();
my $BinDir = $bpc->BinDir();
my %Conf = $bpc->Conf();

$bpc->ChildInit();

my $err = $bpc->ServerConnect($Conf{ServerHost}, $Conf{ServerPort});
if ( $err ) {
print("Can't connect to server ($err)\n");
exit(1);
}
my $reply = $bpc->ServerMesg("status hosts");
$reply = $1 if ( $reply =~ /(.*)/s );
my(%Status, %Info, %Jobs, < at > BgQueue, < at > UserQueue, < at > CmdQueue);
eval($reply);

###########################################################################
# When BackupPC_nightly starts, BackupPC will not run any simultaneous
# BackupPC_dump commands. We first do things that contend with
# BackupPC_dump, eg: aging per-PC log files etc.
###########################################################################

#
# Do per-PC log file aging
#
my($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
if ( $mday == 1 ) {
foreach my $host ( keys(%Status) ) {
my $lastLog = $Conf{MaxOldPerPCLogFiles} - 1;
unlink("$TopDir/pc/$host/LOG.$lastLog")
if ( -f "$TopDir/pc/$host/LOG.$lastLog" );
unlink("$TopDir/pc/$host/LOG.$lastLog.z")
if ( -f "$TopDir/pc/$host/LOG.$lastLog.z" );
for ( my $i = $lastLog - 1 ; $i >= 0 ; $i-- ) {
my $j = $i + 1;
if ( -f "$TopDir/pc/$host/LOG.$i" ) {
rename("$TopDir/pc/$host/LOG.$i", "$TopDir/pc/$host/LOG.$j");
} elsif ( -f "$TopDir/pc/$host/LOG.$i.z" ) {
rename("$TopDir/pc/$host/LOG.$i.z",
"$TopDir/pc/$host/LOG.$j.z");
}
}
#
# Compress the log file LOG -> LOG.0.z (if enabled).
# Otherwise, just rename LOG -> LOG.0.
#
BackupPC::FileZIO->compressCopy("$TopDir/pc/$host/LOG",
"$TopDir/pc/$host/LOG.0.z",
"$TopDir/pc/$host/LOG.0",
$Conf{CompressLevel}, 1);
open(LOG, ">", "$TopDir/pc/$host/LOG") && close(LOG);
}
}

###########################################################################
# Get statistics on the pool, and remove files that have only one link.
###########################################################################

my $fileCnt; # total number of files
my $dirCnt; # total number of directories
my $blkCnt; # total block size of files
my $fileCntRm; # total number of removed files
my $blkCntRm; # total block size of removed files
my $blkCnt2; # total block size of files with just 2 links
# (ie: files that only occur once among all backups)
my $fileCntRep; # total number of file names containing "_", ie: files
# that have repeated md5 checksums
my $fileRepMax; # worse case number of files that have repeated checksums
# (ie: max(nnn+1) for all names xxxxxxxxxxxxxxxx_nnn)
my $fileLinkMax; # maximum number of hardlinks on a pool file
my $fileCntRename; # number of renamed files (to keep file numbering
# contiguous)
my %FixList; # list of paths that need to be renamed to avoid
# new holes
for my $pool ( qw(pool cpool) ) {
$fileCnt = 0;
$dirCnt = 0;
$blkCnt = 0;
$fileCntRm = 0;
$blkCntRm = 0;
$blkCnt2 = 0;
$fileCntRep = 0;
$fileRepMax = 0;
$fileLinkMax = 0;
$fileCntRename = 0;
%FixList = ();
# Original find call
# find({wanted => \&GetPoolStats, no_chdir => 1}, "$TopDir/$pool");
inodefind( \&GetPoolStats, "$TopDir/$pool");
my $kb = $blkCnt / 2;
my $kbRm = $blkCntRm / 2;
my $kb2 = $blkCnt2 / 2;

#
# Now make sure that files with repeated checksums are still
# sequentially numbered
#
foreach my $name ( sort(keys(%FixList)) ) {
my $rmCnt = $FixList{$name} + 1;
my $new = -1;
for ( my $old = -1 ; ; $old++ ) {
my $oldName = $name;
$oldName .= "_$old" if ( $old >= 0 );
if ( !-f $oldName ) {
#
# We know we are done when we have missed at least
# the number of files that were removed from this
# base name, plus a couple just to be sure
#
last if ( $rmCnt-- <= 0 );
next;
}
my $newName = $name;
$newName .= "_$new" if ( $new >= 0 );
$new++;
next if ( $oldName eq $newName );
rename($oldName, $newName);
$fileCntRename++;
}
}
print("BackupPC_stats = $pool,$fileCnt,$dirCnt,$kb,$kb2,$kbRm,$fileCntRm,"
. "$fileCntRep,$fileRepMax,$fileCntRename,"
. "$fileLinkMax\n");
}

###########################################################################
# Tell BackupPC that it is now ok to start running BackupPC_dump
# commands. We are guaranteed that no BackupPC_link commands will
# run since only a single CmdQueue command runs at a time, and
# that means we are safe.
###########################################################################
printf("BackupPC_nightly lock_off\n");

###########################################################################
# Send email
###########################################################################
system("$BinDir/BackupPC_sendEmail");

sub GetPoolStats
{
my($name) = $File::Find::name;
my($baseName) = "";
my( < at > s);

return if ( !-d && !-f );
$dirCnt += -d;
$name = $1 if ( $name =~ /(.*)/ );
< at > s = stat($name);
if ( $name =~ /(.*)_(\d+)$/ ) {
$baseName = $1;
if ( $s[3] != 1 ) {
$fileRepMax = $2 + 1 if ( $fileRepMax <= $2 );
$fileCntRep++;
}
}
if ( -f && $s[3] == 1 ) {
$blkCntRm += $s[12];
$fileCntRm++;
unlink($name);
#
# We must keep repeated files numbered sequential (ie: files
# that have the same checksum are appended with _0, _1 etc).
# There are two cases: we remove the base file xxxx, but xxxx_0
# exists, or we remove any file of the form xxxx_nnn. We remember
# the base name and fix it up later (not in the middle of find).
#
$baseName = $name if ( $baseName eq "" );
$FixList{$baseName}++;
} else {
$fileCnt += -f;
$blkCnt += $s[12];
$blkCnt2 += $s[12] if ( -f && $s[3] == 2 );
$fileLinkMax = $s[3] if ( $fileLinkMax < $s[3] );
}
}

# Dougs special find command for looking up files in inode order
sub inodefind {
my ($wanted, $dir) = < at > _;
open(FILE, "/usr/local/bin/inodefind $dir | /usr/bin/sort -n | /usr/bin/cut -d' ' -f2|");
while (defined(my $name = <FILE>)) {
chomp($name);
$File::Find::name = $name;
stat($name);
$_ = $name;
&$wanted;
}
close(FILE);
}

Post BackupPC_nightly taking very long 
Hi,

On Wed, 2004-04-07 at 14:32, Simon Strack wrote:
Hello,

sorry for not speaking up earlier, but time and wanting to check out
some things first prevented me from doing so. (Thanks Guus!)

Smile

Some months ago (about October/November), my colleague ( Douglas
Thomson) and I were working on this long running of BackupPC_nightly
process problem. We decided that it was because of all the disk seeking
trying to go to each file in turn which were spread all over the place.

My colleague wrote a C program, and patched BackupPC_nightly, and our
overnight run went from over 8 hours, to 10 minutes on an ext3 filesystem.

It went on my setup from about 7 hours to about 50 minutes (2.6.4
kernel, reiserfs v3, noatime).

2004/4/7 01:55:55 Running BackupPC_nightly (pid=5529)
2004/4/7 01:55:56 Pool nightly clean removed 0 files of size 0.00GB
2004/4/7 01:55:56 Pool is 0.00GB, 0 files (0 repeated, 0 max chain, 2
max links), 1 directories
2004/4/7 02:00:00 Next wakeup is 2004/4/7 03:00:00
2004/4/7 02:46:40 Cpool nightly clean removed 51945 files of size 5.04GB
2004/4/7 02:46:40 Cpool is 80.74GB, 2680108 files (115 repeated, 11 max
chain, 60611 max links), 4369 directories

<snip>

SiMoN



______________________________________________________________________
#!/usr/bin/perl
#============================================================= -*-perl-*-
#
# BackupPC_nightly: Nightly cleanup & statistics script.
#

<snip>

###########################################################################
# Send email
###########################################################################
system("$BinDir/BackupPC_sendEmail");

sub GetPoolStats
{
my($name) = $File::Find::name;
my($baseName) = "";
my( < at > s);

I needed this to get it working right:

sub GetPoolStats
{
my($name) = $File::Find::name;
$_ = $name;

my($baseName) = "";
etc etc

<snip>

Thanks for your time and effort!

Regards,

--
Guus Houtzager Email: guus < at > houtzager.net
PGP fingerprint = 5E E6 96 35 F0 64 34 14 CC 03 2B 36 71 FB 4B 5D
"A)bort, R)etry, I)nfluence with large hammer."





-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post BackupPC_nightly taking very long 
Guus Houtzager wrote:

Hi,

On Wed, 2004-04-07 at 14:32, Simon Strack wrote:


______________________________________________________________________
#!/usr/bin/perl
#============================================================= -*-perl-*-
#
# BackupPC_nightly: Nightly cleanup & statistics script.
#



<snip>



###########################################################################
# Send email
###########################################################################
system("$BinDir/BackupPC_sendEmail");

sub GetPoolStats
{
my($name) = $File::Find::name;
my($baseName) = "";
my( < at > s);



I needed this to get it working right:

sub GetPoolStats
{
my($name) = $File::Find::name;
$_ = $name;

my($baseName) = "";
etc etc



I moved "$_ = $name;" into the inodefind function, but it doesn't really
matter which spot it goes.

SiMoN



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post BackupPC_nightly taking very long 
Simon Strack writes:

sorry for not speaking up earlier, but time and wanting to check out
some things first prevented me from doing so. (Thanks Guus!)

Some months ago (about October/November), my colleague ( Douglas
Thomson) and I were working on this long running of BackupPC_nightly
process problem. We decided that it was because of all the disk seeking
trying to go to each file in turn which were spread all over the place.

My colleague wrote a C program, and patched BackupPC_nightly, and our
overnight run went from over 8 hours, to 10 minutes on an ext3 filesystem.

The way it works is to scan the pool directory (directories only!), and
then process each file in inode number order. The presumption is that
inodes are grouped together on the disk, and so less disk seeking is needed.

Attached are a C program, an example perl program, and a patched
BackupPC_nightly (without all the recent speed-ups pasted on this list
or cvs).

The C program just reads directory info and spits it out, but that dir
info includes the inode number. A couple of shell commands sorts things
into inode order, and the simple find function just visits each file in
turn with the standard visit function.

Very impressive improvement! Unfortunately the readdir() function
in perl doesn't return the inode number, just the file name, so
you are correct that an external C program is needed.

For 2.1.0beta1 I have made three improvements to BackupPC_nightly:

* Improved stat() usage in BackupPC_nightly, plus some other cleanup,
giving a significant performance improvement. Patch submitted by
Wayne Scott.

* Allow several BackupPC_nightly processes to run in parallel based
on new $Conf{BackupPCNightlyJobs} setting. This speeds up the
traversal of the pool, reducing the overall run time for
BackupPC_nightly.

* Allow BackupPC_nightly to split the pool traversal across several
nightly runs. This improves the running time per night, at the expense
of a slight increase in disk storage as unused pool files might not
be deleted for a couple of days. Controller by new config setting
$Conf{BackupPCNightlyPeriod}.

In combination these should allow BackupPC_night to run in a
reasonable time.

Because of these changes, your patch will be a little harder to
incorporate, but it's not too bad. To allow the pool scanning to
be broken up (either across multiple processes or multiple nights),
the find() is now called on each of the 256 subdirectories (0/0
through f/f). A given BackupPC_nightly (depending upon the
config settings) will now call find() on a subset of these
directories.

Your patch will provide the biggest improvement when only a
single BackupPC_nightly is running.

When I release 2.1.0beta1 it would be great for you and Doug to think
about how to include your sorted-inode idea. I'm happy to discuss
this further. Also, I assume deleting a backup tree would be faster
if done in inode order too (although I don't see how to traverse just
the directories without stat()ing every file, which kind of defeats
the purpose). Perhaps this could be turned into a more general,
optional, accelerator for BackupPC_nightly, BackupPC_trashClear (and
maybe even filling) of backups.

However, I feel this is really only a temporary hack.

I think the proper way of handling this is to do reference counting.
If a database of the files in the pool was kept (probably db, or maybe
sqlite), a simple db query would give a list of the files to clean up.

Maintaining consistency between the filesystem and the db would probably
be the next trick. But it should be possible to do that in parallel to
normal operations, rather than locking everything up. (Scan filesystem
for candidate files for deletion in parallel, then check them against
the db.)

I've thought about doing all of this in an sql database, but the
amount of data seems quite large. The existing file system, and
hardlink reference counts, provides an efficient and high performance
solution. If the reference counting moved to a data base then there
would be no need for BackupPC_nightly to scan the pool. Sql provides
the relevant atomic operations to avoid the race conditions that
currently require BackupPC_nightly to run while no other backups are
running (in particular, BackupPC_nightly could be deleting a pool
file with 1 link, just at the same time BackupPC_dump is adding a
link to it).

Another way of looking at it is to see it as a kind of garbage
collection problem. Languages seem to have that under control,
so there may be something to learn from them.

When a backup is deleted, BackupPC_trashClean knows what files are
being deleted. So just those inodes are candidates for the reference
count check. But, unfortunately, BackupPC_trashClean cannot easily
determine the pool file name from the backup file name, without
recomputing the MD5 checksum, which is expensive.

Craig


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post BackupPC_nightly taking very long 
From: Craig Barratt <cbarratt < at > users.sourceforge.net>
Very impressive improvement! Unfortunately the readdir() function
in perl doesn't return the inode number, just the file name, so
you are correct that an external C program is needed.

Agreed.

Does anyone know if deleting files in inode order helps
reiserfs? I know that inodes are "made up" on reiser, but I suspect
they are releated to the file hash so it still might help since the
files are indexed by an ordered tree.

I has hoping that 'ls -i' or find(1) might be used to find the inodes
without stat'ing every file, but it seems that neither utility has
that optimization. So yes it needs to be a C funciton. It makes
packaging BackupPC very annoying if only a small part of it needs C.

-Wayne

Post BackupPC_nightly taking very long 
On Fri, 2004-04-09 at 17:03, Wayne Scott wrote:
From: Craig Barratt <cbarratt < at > users.sourceforge.net>
Very impressive improvement! Unfortunately the readdir() function
in perl doesn't return the inode number, just the file name, so
you are correct that an external C program is needed.

Agreed.

Does anyone know if deleting files in inode order helps
reiserfs? I know that inodes are "made up" on reiser, but I suspect
they are releated to the file hash so it still might help since the
files are indexed by an ordered tree.

Yes it works on reiser. It went from 7 hours to about 55 minutes on my
setup.

-Wayne

Regards,

Guus

Display posts from previous:
Reply to topic Page 3 of 3
Goto page Previous  1, 2, 3
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB