 |
Page 1 of 1
|
| Author |
Message |
Arbogast, Warren K
Guest
|
 Slow backup
There is a Linux fileserver here that serves web content. It has 21 million files in one filesystem named /ip. There are over 4,500 directories at the second level of the filesystem. The server is running the 6.3.0.0 client, and has 2 virtual cpus and 16 GB of RAM. Resourceutilization is set to 10, and currently there are six client sessions running. I am looking for ways to accelerate the backup of this server since currently it never ends.
The filesystem is NFS mounted so a journal based backup won't work. Recently, we added four proxy agents, and are splitting up the one big filesystem among them using include/exclude statements. Here is one of the agent's include/exclude files.
exclude /ip/[g-z]*/.../*
include /ip/[a-f]*/.../*
__Since we added the proxies the proxy backups are copying many thousands of files, as if this were the first backup of the server as a whole. Is that expected behavior?
__Recently, the TSM server database is growing faster than it usually does, and I'm wondering whether there could be any correlation between the ultra long running backup, many thousands of files copied, and the faster pace of the database growth.
__The four proxies haven't made a big difference in the run time of the backup. Could something else be done to speed it up?
Thank you,
Keith Arbogast
Indiana University
|
| Mon Aug 06, 2012 12:15 pm |
|
 |
Huebner,Andy,FORT WORT...
Guest
|
 Slow backup
You said the file system is NFS mounted. Does that mean there is a NAS server?
To go fast with millions of files you either have to journal or do a block level backup.
If the file system is owned by a dedicated NAS device NDMP may be the answer. If the file system is owned by an OS that can journal then that is an option. If the file system is owned by a virtual server then an image backup from the host is an option.
Andy Huebner
-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L < at > VM.MARIST.EDU] On Behalf Of Arbogast, Warren K
Sent: Monday, August 06, 2012 3:13 PM
To: ADSM-L < at > VM.MARIST.EDU
Subject: [ADSM-L] Slow backup
There is a Linux fileserver here that serves web content. It has 21 million files in one filesystem named /ip. There are over 4,500 directories at the second level of the filesystem. The server is running the 6.3.0.0 client, and has 2 virtual cpus and 16 GB of RAM. Resourceutilization is set to 10, and currently there are six client sessions running. I am looking for ways to accelerate the backup of this server since currently it never ends.
The filesystem is NFS mounted so a journal based backup won't work. Recently, we added four proxy agents, and are splitting up the one big filesystem among them using include/exclude statements. Here is one of the agent's include/exclude files.
exclude /ip/[g-z]*/.../*
include /ip/[a-f]*/.../*
__Since we added the proxies the proxy backups are copying many thousands of files, as if this were the first backup of the server as a whole. Is that expected behavior?
__Recently, the TSM server database is growing faster than it usually does, and I'm wondering whether there could be any correlation between the ultra long running backup, many thousands of files copied, and the faster pace of the database growth.
__The four proxies haven't made a big difference in the run time of the backup. Could something else be done to speed it up?
Thank you,
Keith Arbogast
Indiana University
This e-mail (including any attachments) is confidential and may be legally privileged. If you are not an intended recipient or an authorized representative of an intended recipient, you are prohibited from using, copying or distributing the information in this e-mail or its attachments. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete all copies of this message and any attachments.
Thank you.
|
| Mon Aug 06, 2012 12:59 pm |
|
 |
Skylar Thompson
Guest
|
 Slow backup
If you look at the scheduler logs on the proxy agents, do you see a lot
of file retries? If you do, it might help using a non-shared static or
dynamic copy serialization on your copy groups. That'll save a bunch of
stat(2) calls from dsmc. stat over NFS is very expensive, because many
implementations will trigger a cache flush on the server for that file
to get an accurate block count. Switching the copy serialization reduced
some of our proxy node-based back times by 10x.
-- Skylar Thompson (skylar2 < at > u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
On 08/ 6/12 01:12 PM, Arbogast, Warren K wrote:
There is a Linux fileserver here that serves web content. It has 21 million files in one filesystem named /ip. There are over 4,500 directories at the second level of the filesystem. The server is running the 6.3.0.0 client, and has 2 virtual cpus and 16 GB of RAM. Resourceutilization is set to 10, and currently there are six client sessions running. I am looking for ways to accelerate the backup of this server since currently it never ends.
The filesystem is NFS mounted so a journal based backup won't work. Recently, we added four proxy agents, and are splitting up the one big filesystem among them using include/exclude statements. Here is one of the agent's include/exclude files.
exclude /ip/[g-z]*/.../*
include /ip/[a-f]*/.../*
__Since we added the proxies the proxy backups are copying many thousands of files, as if this were the first backup of the server as a whole. Is that expected behavior?
__Recently, the TSM server database is growing faster than it usually does, and I'm wondering whether there could be any correlation between the ultra long running backup, many thousands of files copied, and the faster pace of the database growth.
__The four proxies haven't made a big difference in the run time of the backup. Could something else be done to speed it up?
Thank you,
Keith Arbogast
Indiana University
|
| Mon Aug 06, 2012 2:12 pm |
|
 |
Allen S. Rout
Guest
|
 Slow backup
On 08/06/2012 04:12 PM, Arbogast, Warren K wrote:
There is a Linux fileserver here that serves web content. It has 21 million files in one filesystem named /ip. There are over 4,500 directories at the second level of the filesystem. The server is running the 6.3.0.0 client, and has 2 virtual cpus and 16 GB of RAM. Resourceutilization is set to 10, and currently there are six client sessions running. I am looking for ways to accelerate the backup of this server since currently it never ends.
The filesystem is NFS mounted so a journal based backup won't
work. Recently, we added four proxy agents, and are splitting up the
one big filesystem among them using include/exclude statements. Here
is one of the agent's include/exclude files.
exclude /ip/[g-z]*/.../*
include /ip/[a-f]*/.../*
... You say "proxy agents", but it's not clear what you mean by this.
__Since we added the proxies the proxy backups are copying many
thousands of files, as if this were the first backup of the server
as a whole. Is that expected behavior?
I see two possible arrangements you might have implemented.
Possibility one: where you had BIGFS-NODE which was taking a long
time, you now have BIGFS-NODEAF with the config above, and
BIGFS-NODEGL which is defined to include G through L, etc.
In this case, each BIGFS subnode will need to re-back-up its initial
incremental, and thereafter you should see normal change rates.
I will note that this is unlikely to accelerate your wall-clock time;
if you've got resourceutilization 10, you've probably got 5+ threads
walking the FS, you've probably moved your bottleneck to IOPS on your
NAS as it tries to pull the metadata to satisfy the FS walk. 20
threads won't do that faster.
Possibility two: you have four agents with different include/exclude
statements but using the same BIGFS node.
In this case you are running a charlie-foxtrot formation. One process
is backing stuff up while another is expiring the very same stuff. If
you're doing this, you should stop it immediately, go back to one
agent, and complete an incremental, because you have your backups in
an indeterminate state.
__Recently, the TSM server database is growing faster than it
usually does, and I'm wondering whether there could be any
correlation between the ultra long running backup, many thousands of
files copied, and the faster pace of the database growth.
This symptom is what makes me think you're doing the latter. If one
process is adding stuff while another throws it away, the result is a
rapidly growing tail of inactive versions, capped by (I think)
VERDELETED. I don't know off the top of my head if an excluded file
is retained at VEREXISTS or VERDELETED. Interesting.
- Allen S. Rout
|
| Tue Aug 07, 2012 6:57 am |
|
 |
Richard Sims
Guest
|
 Slow backup
Keith - Just a few thoughts you may have entertained…
In that it's a network file system, network throughput may be a factor, where gigabit ethernet and jumbo frames may help. Network statistics may reveal periodic packet losses and retransmissions which slow things down: the 'nfsstat' command may illuminate issues. Various NFS performance guides may help, such as http://nfs.sourceforge.net/nfs-howto/ar01s05.html.
The file system architecture may be a contributor, depending upon file system type (ext2, ext3, ext4) as well as RAID configuration. File system architectures with flat directory data structures and high populations can be a drag. (File system design evolution has made for performance improvements.) Try some file system traversals, natively on the hosting system and then comparatively over NFS, to see if there's an issue there. Heavy file system activity by applications may cause contention, and can be more problematic if that access is over NFS as well. You might also try traversals within various file system sections to possibly identify a sluggish area there there are very large directory populations or heavy activity.
Review your TSM accounting records or ANE session summary records in the Activity Log to get numbers on what's most contributing to TSM database activity and thus growth therein which may be slowing backups.
Note that excluding directories from backups is far more efficient than trying to exclude by all files in the directory.
Richard Sims
|
| Tue Aug 07, 2012 7:07 am |
|
 |
Arbogast, Warren K
Guest
|
 Slow backup
Allen,
Thank you for the suggestions.
By 'proxy agent' I mean they are authorized to do backups on behalf of the target server.
We are doing possibility one, in your set of cases, with four agents. I kept the example simple for readability, but perhaps some clarity was lost.
It seems we may just need to endure slow backups until the entire filesystem has been copied again by the proxy agents.
You say, "I will note that this is unlikely to accelerate your wall-clock time;
if you've got resourceutilization 10, you've probably got 5+ threads
walking the FS, you've probably moved your bottleneck to IOPS on your
NAS as it tries to pull the metadata to satisfy the FS walk. 20
threads won't do that faster."
Are you saying that reducing Resourceutilization would likely improve the throughput of the backup? Or, that the "backup by proxy plan itself is ill conceived for some other reason?
Thank you,
Keith
|
| Tue Aug 07, 2012 7:19 am |
|
 |
Allen S. Rout
Guest
|
 Slow backup
On 08/07/2012 11:10 AM, Arbogast, Warren K wrote:
By 'proxy agent' I mean they are authorized to do backups on behalf
of the target server.
We are doing possibility one, in your set of cases, with four
agents. I kept the example simple for readability, but perhaps some
clarity was lost.
I'm being pedantic here, but your choices of vocabulary ("...on behalf
of the target server") still leave me concerned you may be
unintentionally telling one machine to store what another is
discarding. If you've got four machines with e.g.
grant proxynode target=BIGFS agent=BIGFS_AF
grant proxynode target=BIGFS agent=BIGFS_GL
and BIGFS_AF is backing up "on behalf of the target server" with the
include/excludes you've mentioned, then you are in category two.
BIGFS_AF is backing up /ip/a* (among other things) and assigning it to a
filespace named '/ip' associated with a node named BIGFS.
BIGFS_GL will be throwing away that same data, and attempting to send
'/ip/g*', which BIGFS_AF (or other siblings) will attempt to slaughter.
If you could include TSM option files (.opt; and .sys if it's there;
is this a unix or windows setup?) for 'the target server' and two of
the proxies it would completely disambiguate these cases.
You say, "I will note that this is unlikely to accelerate your wall-clock time;
if you've got resourceutilization 10, you've probably got 5+ threads
walking the FS, you've probably moved your bottleneck to IOPS on your
NAS as it tries to pull the metadata to satisfy the FS walk. 20
threads won't do that faster."
Are you saying that reducing Resourceutilization would likely
improve the throughput of the backup? Or, that the "backup by proxy
plan itself is ill conceived for some other reason?
'Ill concieved' is too strong a term. I'd only go so far as "Possibly
not helping you any".
The key point is identifying your bottleneck, and then determining
wether your contemplated measures affect the bottleneck.
Gedankenexperiment with me:
For most large filesystem installations, i.e. millions of files, the
performance bottleneck for conventional TSM guest-level incrementals
is the turnaround time reading metadata off the filesystem to ask 'Has
this file changed?' (in your case) 21 million times. Plus a few tens
of thousands for directories.
So why is that a bottleneck? Usually it's because large filesystems
are not exceptionally high-performance stores, and are consequently
stored on biggish RAIDs of cheapish disk. Let's say you're on EMC
disk which IIRC suggests 8-spindle RAID groups, and let's further go
with the 'cheapish' disk: SATA with 80-100 IOPS.
So off a RAID group of 8 disks, you'll get 80 to 100 reads a second.
Say your 21M files occupy 21TB; you're using smallish 1TB drives, so
you've got 3 RAID groups. If your ducks are totally in a row, you can
process 300 IOPS a second. If you're using newer 3T disks, then
you're down to 100. Yow.
I'll handwave over wether an IOP is required for each file; there,
we're beyond my statistical envelope-back. But if you do, then the
_expected_ wall clock time to simply ask the questions about all the
files is 19 hours (3xraid of 1T drives).
And somewhere in there, you might want to also read some data. Also,
customers might want to use the file store for something; in the way
as always.
OK. So I fantasize that your slow performance is because of some
situation grossly similar to this.
Do you see how adding another backup reader with its own 5 FS walking
threads doesn't affect the problem? You'll still take 19 hours to do
the 21M IOPS, and you might generate more contention.
- Allen S. Rout
|
| Tue Aug 07, 2012 8:05 am |
|
 |
Roger Deschner
Guest
|
 Slow backup
We've got a similar beast. The problem is the sheer number of files,
which the client must keep a list of, in order to decide which files
need to be backed up. This is the whole idea behind TSM's Progressive
Incremental backup model. We have found that as file counts reach the
many millions, backup (and restore!) performance degrades exponentially.
One fact you must face is that a filespace with 21,000,000 files almost
cannot be restored. We tried it with 18,000,000 and it counted and
sorted its list for about a day before restoring any files - slowly. The
restore would have taken many days when we stopped it to reconsider.
There is no restore equivalent to MEMORYEFFICIENTBACKUP YES or
MEMORYEFFICIENT DISKCACHEMETHOD for restore. The ability to restore is
the whole point of backup, so take restore issues seriously.
The only workable answer we have found, without drastically changing the
thing you are backing up, is Virtual Mount Points. The idea is to keep
the file count in each Virtual Mount Point low enough that the client
can work efficiently.
Roger Deschner University of Illinois at Chicago rogerd < at > uic.edu
Academic Computing & Communications Center
======I have not lost my mind -- it is backed up on tape somewhere.=====
On Mon, 6 Aug 2012, Arbogast, Warren K wrote:
There is a Linux fileserver here that serves web content. It has 21 million files in one filesystem named /ip. There are over 4,500 directories at the second level of the filesystem. The server is running the 6.3.0.0 client, and has 2 virtual cpus and 16 GB of RAM. Resourceutilization is set to 10, and currently there are six client sessions running. I am looking for ways to accelerate the backup of this server since currently it never ends.
The filesystem is NFS mounted so a journal based backup won't work. Recently, we added four proxy agents, and are splitting up the one big filesystem among them using include/exclude statements. Here is one of the agent's include/exclude files.
exclude /ip/[g-z]*/.../*
include /ip/[a-f]*/.../*
__Since we added the proxies the proxy backups are copying many thousands of files, as if this were the first backup of the server as a whole. Is that expected behavior?
__Recently, the TSM server database is growing faster than it usually does, and I'm wondering whether there could be any correlation between the ultra long running backup, many thousands of files copied, and the faster pace of the database growth.
__The four proxies haven't made a big difference in the run time of the backup. Could something else be done to speed it up?
Thank you,
Keith Arbogast
Indiana University
|
| Tue Aug 07, 2012 8:56 am |
|
 |
Allen S. Rout
Guest
|
 Slow backup
On 08/07/2012 12:46 PM, Roger Deschner wrote:
There is no restore equivalent to MEMORYEFFICIENTBACKUP YES or
MEMORYEFFICIENT DISKCACHEMETHOD for restore. The ability to restore is
the whole point of backup, so take restore issues seriously.
Doesn't no query restore help some with that? It's no panacea, to be
sure. But...
- Allen S. Rout
|
| Tue Aug 07, 2012 9:51 am |
|
 |
Arbogast, Warren K
Guest
|
 Slow backup
Allen,
I see your point, finally, that BIGFS_AF will copy files, but BIGFS_GL will remove them. Clearly, splitting the backup of one filesystem among multiple proxy nodes via include/exclude statements is futile. There are 4500+ directories under /ip, so virtualmountpoints aren't workable either.
Andy, Skylar, Richard and Roger,
You responses are full of good advice that will take me a few days to absorb and implement. I am on a different track now than before.
With many thanks and best wishes,
Keith
=
|
| Tue Aug 07, 2012 11:38 am |
|
 |
Allen S. Rout
Guest
|
 Slow backup
On 08/07/2012 03:26 PM, Arbogast, Warren K wrote:
There are 4500+ directories under /ip, so virtualmountpoints aren't
workable either.
... Why? It's not hard, it's just big.
Envision this:
find /ip -maxdepth 1 -type d | awk '{ print "virtualmountpoint ",$1}'
/var/tmp/dsm.sys.vmp
cat /var/tmp/dsm.sys.the_rest_of_my_stuff /var/tmp/dsm.sys.vmp >
/[...]path/dsm.sys
Rejigger that for however you maintain your configuration data; but the
key point is that you can mechanically produce the list of VMPs, and
separate them from the maintenance of the rest of your TSM config.
You're not going to thrash the TSM server by making 4500 mountpoints,
you'll just make your list of filespaces be REALLY LONG.
If your files are fairly evenly divided between these first-level dirs,
then you've turned a 21M file problem into a thousand file problem,
which is to say "No problem".
... Just don't collocate by filespace.
- Allen S. Rout
|
| Wed Aug 08, 2012 5:51 am |
|
 |
Arbogast, Warren K
Guest
|
 Slow backup
Hi Allen,
This is quite impressive. I will discuss it with the client admin.
Thank you,
Keith
On Aug 8, 2012, at 9:43 AM, Allen S. Rout wrote:
On 08/07/2012 03:26 PM, Arbogast, Warren K wrote:
There are 4500+ directories under /ip, so virtualmountpoints aren't
workable either.
... Why? It's not hard, it's just big.
Envision this:
find /ip -maxdepth 1 -type d | awk '{ print "virtualmountpoint ",$1}'
/var/tmp/dsm.sys.vmp
cat /var/tmp/dsm.sys.the_rest_of_my_stuff /var/tmp/dsm.sys.vmp > /[...]path/dsm.sys
Rejigger that for however you maintain your configuration data; but the key point is that you can mechanically produce the list of VMPs, and separate them from the maintenance of the rest of your TSM config.
You're not going to thrash the TSM server by making 4500 mountpoints, you'll just make your list of filespaces be REALLY LONG.
If your files are fairly evenly divided between these first-level dirs, then you've turned a 21M file problem into a thousand file problem, which is to say "No problem".
... Just don't collocate by filespace.
- Allen S. Rout
|
| Wed Aug 08, 2012 6:39 am |
|
 |
Skylar Thompson
Guest
|
 Slow backup
On 8/8/2012 6:43 AM, Allen S. Rout wrote:
On 08/07/2012 03:26 PM, Arbogast, Warren K wrote:
There are 4500+ directories under /ip, so virtualmountpoints aren't
workable either.
... Why? It's not hard, it's just big.
Envision this:
find /ip -maxdepth 1 -type d | awk '{ print "virtualmountpoint ",$1}'
/var/tmp/dsm.sys.vmp
cat /var/tmp/dsm.sys.the_rest_of_my_stuff /var/tmp/dsm.sys.vmp >
/[...]path/dsm.sys
Rejigger that for however you maintain your configuration data; but the
key point is that you can mechanically produce the list of VMPs, and
separate them from the maintenance of the rest of your TSM config.
You're not going to thrash the TSM server by making 4500 mountpoints,
you'll just make your list of filespaces be REALLY LONG.
If your files are fairly evenly divided between these first-level dirs,
then you've turned a 21M file problem into a thousand file problem,
which is to say "No problem".
... Just don't collocate by filespace.
If you want to limit the number of filespaces per node, you can also
assign some the filespaces to other nodes, either using proxy nodes or
creating separate SERVERNAME stanzas in your dsm.sys/opt files. We do
this for some of our fileservers that are jointly owned by different
groups, so that we can chargeback backup usage to the individual groups.
--
-- Skylar Thompson (skylar2 < at > u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S048, (206)-685-7354
-- University of Washington School of Medicine
|
| Wed Aug 08, 2012 9:33 am |
|
 |
|
|
The time now is Wed Jun 19, 2013 6:14 pm | All times are GMT - 8 Hours
|
Page 1 of 1
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|