SearchFAQMemberlist Log in
Reply to topic Page 1 of 1
locating existing files... or just how to report all hardlin
Author Message
Post locating existing files... or just how to report all hardlin 
Dear All,

First idea was to ask all of you if such feature is possible/available
in backuppc.

I love backuppc feature to detect duplicate files and save used by them
space by hard-linking them to entries in the pool...

What I wanted to have is the ability to see for each file in the backup
or in the pool, where else that file appears? This way I would be able
to track waste of the space on the machines due to duplicate copies of
the same files.

So I couldn't find that feature in backuppc and it seems that backuppc
on its own doesn't save information on to which files the entry in
cpool/ corresponds... everything must be done via tracking of
number of hardlinks (I'm sorry if I'm mistaken).

So to do what I want I decided to adjust beautiful Linux tool --
locate, or to be more precise - updatedb script, so it does include
information about inode of each file, and then make it possible to
perform query on inode number of the file, thus giving me the list of
all hardlinks to the same node...

Now the question is just of putting it in CGI of backuppc which
shouldn't be a problem. But I've decided to ask first - does anyone find
such feature useful? is there a better solution??



--
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers
Office (973) 353-5440 x263
Ph.D. Student CS Dept. NJIT
Key http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8



-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post locating existing files... or just how to report all hardlin 
I think this is what you are looking for:
http://www.geocities.com/fcheck2000/finddups.html

-Wyne


-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post locating existing files... or just how to report all hardlin 
That is a good one, but I we don't quite need to do CRC or md5 checks to
find within 1 filesystem all the hardlinks... Also locate later on gives
me blazingly fast report on a specific file, so I don't need to run
finddups again. Sure I can store result of finddups and then just search
within it, but finddups on 0.5TB drive will take quite long time I
believe without necessity in case of backuppc application


On Thu, Aug 05, 2004 at 01:48:51PM -0500, Wayne Scott wrote:

I think this is what you are looking for:
http://www.geocities.com/fcheck2000/finddups.html

-Wyne

--
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers
Office (973) 353-5440 x263
Ph.D. Student CS Dept. NJIT
Key http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8



-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post locating existing files... or just how to report all hardlin 
Yaroslav Halchenko writes:

What I wanted to have is the ability to see for each file in the backup
or in the pool, where else that file appears? This way I would be able
to track waste of the space on the machines due to duplicate copies of
the same files.

So I couldn't find that feature in backuppc and it seems that backuppc
on its own doesn't save information on to which files the entry in
cpool/ corresponds... everything must be done via tracking of
number of hardlinks (I'm sorry if I'm mistaken).

So to do what I want I decided to adjust beautiful Linux tool --
locate, or to be more precise - updatedb script, so it does include
information about inode of each file, and then make it possible to
perform query on inode number of the file, thus giving me the list of
all hardlinks to the same node...

Now the question is just of putting it in CGI of backuppc which
shouldn't be a problem. But I've decided to ask first - does anyone find
such feature useful? is there a better solution??

Unfortunately the current pool/hardlink structure makes it
very expensive to find all files that have the same inode
(ie: identical files). I don't know of a faster approach
than finding the inode of the file and then searching the
entire pc directory tree for that inode:

ls -li /data/BackupPC/pc/HOST/123/fx/fy/fz
find /data/BackupPC/pc -inum 123456789 -print

where 123456789 is the inode number displayed by ls -li.
That's pretty expensive.

Craig


-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Post locating existing files... or just how to report all hardlin 
On Thu, Aug 05, 2004 at 11:12:55PM -0700, Craig Barratt wrote:
Unfortunately the current pool/hardlink structure makes it
very expensive to find all files that have the same inode
(ie: identical files). I don't know of a faster approach
than finding the inode of the file and then searching the
entire pc directory tree for that inode:

ls -li /data/BackupPC/pc/HOST/123/fx/fy/fz
find /data/BackupPC/pc -inum 123456789 -print

where 123456789 is the inode number displayed by ls -li.
That's pretty expensive.
Yes - that is expensive and that is why I suggested to use 'locate'
approach - index all such finds and using existing locate program +
wrapper do such finds in 1-7 seconds.I've tested on our pool repository
of 200GB from 8 nodes with around 5 backups each...works like a charm
Wink


backuppc < at > sink:~$ time ./ilocate cpool/3/b/f/3bf2ac4a1407590a535b63cce0935185
5017508 cpool/3/b/f/3bf2ac4a1407590a535b63cce0935185
5017508 pc/ravana/10/f%2fraid/fhome/fmatsuka/fmatwork/fs2all.mat
5017508 pc/ravana/10/f%2fraid/fresearch/fhaxby/fhaxbydata/fs2all.mat
5017508 pc/ravana/3/f%2fraid/fhome/fmatsuka/fmatwork/fs2all.mat
5017508 pc/ravana/3/f%2fraid/fresearch/fhaxby/fhaxbydata/fs2all.mat

real 0m5.835s
user 0m2.697s
sys 0m0.068s

The first number is inode number... And now I know that the guy probably
copied that huge file from /research to his home directory...


--
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers
Office (973) 353-5440 x263
Ph.D. Student CS Dept. NJIT
Key http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8

Post locating existing files... or just how to report all hardlin 
I'm not sure about applying it for BackupPC_nightly but here is what
I've got Smile I've sketched a webpage just because it might be usefull
for someone else as well

http://www.onerussian.com/Linux/ilocate/ilocate.phtml

Sincerely
Yarik


On Fri, Aug 06, 2004 at 10:33:22AM -0400, Leon Letto wrote:
That is pretty cool. Better to be 8 seconds than 20 minutes. Can you point
us to any pages on the internet that tell how to setup the indexes and any
other parameters needed to get this to work?

If it's easy to setup or can be scripted, it might be a great way to get
BackupPC_nightly to run a lot faster.

Thanks,

Leon

-----Original Message-----
From: Yaroslav Halchenko [mailto:yoh < at > psychology.rutgers.edu]
Sent: Friday, August 06, 2004 8:31 AM
To: Craig Barratt
Cc: backuppc-users < at > lists.sourceforge.net
Subject: Re: [BackupPC-users] locating 'existing' files... or just how to
report all hardlinked files

On Thu, Aug 05, 2004 at 11:12:55PM -0700, Craig Barratt wrote:
Unfortunately the current pool/hardlink structure makes it very
expensive to find all files that have the same inode
(ie: identical files). I don't know of a faster approach than finding
the inode of the file and then searching the entire pc directory tree
for that inode:

ls -li /data/BackupPC/pc/HOST/123/fx/fy/fz
find /data/BackupPC/pc -inum 123456789 -print

where 123456789 is the inode number displayed by ls -li.
That's pretty expensive.
Yes - that is expensive and that is why I suggested to use 'locate'
approach - index all such finds and using existing locate program + wrapper
do such finds in 1-7 seconds.I've tested on our pool repository of 200GB
from 8 nodes with around 5 backups each...works like a charm
Wink


backuppc < at > sink:~$ time ./ilocate cpool/3/b/f/3bf2ac4a1407590a535b63cce0935185
5017508 cpool/3/b/f/3bf2ac4a1407590a535b63cce0935185
5017508 pc/ravana/10/f%2fraid/fhome/fmatsuka/fmatwork/fs2all.mat
5017508 pc/ravana/10/f%2fraid/fresearch/fhaxby/fhaxbydata/fs2all.mat
5017508 pc/ravana/3/f%2fraid/fhome/fmatsuka/fmatwork/fs2all.mat
5017508 pc/ravana/3/f%2fraid/fresearch/fhaxby/fhaxbydata/fs2all.mat

real 0m5.835s
user 0m2.697s
sys 0m0.068s

The first number is inode number... And now I know that the guy probably
copied that huge file from /research to his home directory...
--
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers
Office (973) 353-5440 x263
Ph.D. Student CS Dept. NJIT
Key http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8

Post locating existing files... or just how to report all hardlin 
Yaroslav Halchenko writes:

I'm not sure about applying it for BackupPC_nightly but here is what
I've got Smile I've sketched a webpage just because it might be usefull
for someone else as well

http://www.onerussian.com/Linux/ilocate/ilocate.phtml

Now that's clever! Thanks for sharing.

I'll add this to my "FAQ todo" list.

Craig


-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Display posts from previous:
Reply to topic Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB