Jim Weill
Backups fail on specific directory
February 20, 2018 11:59AM
We've had a single specific Linux server which fails its "big" backups
(level 3, 5, and full) on a regular basis because a specific directory
fills up faster than than the networker daemon can keep up.  At least,
that's the error we get "size grew during save".

I've inserted a .nsr file at the base of that directory with "Skip: *.*
" inside of that to try to get it to skip that directory altogether. 
Over the weekend, during its level 5 backup, it gave the message that it
had parsed the .nsr file but still also gave a failure.  My questions
are:  how can I tell if the backup actually failed?  And is there any
other way to get it to skip that directory and return a successful
backup?  Thanks in advance.

jim


--
This list is hosted as a public service at Temple University by Stan Horwitz
If you wish to sign off this list or adjust your subscription settings, please do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
If you have any questions regarding management of this list, please send email to owner-emc-dataprotection-l@listserv.temple.edu
This message was imported via the External PhorumMail Module
Re: Backups fail on specific directory
February 20, 2018 02:59PM
In regard to: [EMC-DataProtection-L] Backups fail on specific directory,...:

> We've had a single specific Linux server which fails its "big" backups
> (level 3, 5, and full) on a regular basis because a specific directory
> fills up faster than than the networker daemon can keep up. At least,
> that's the error we get "size grew during save".

Size grew during save isn't something that I've ever seen cause an
entire backup to fail. Usually, that's a warning that means that the
file(s) in question are changing while the backup is going on, which means
that a restore of just those files may not correctly recover them. The
rest of the backup is normally still recoverable.

> I've inserted a .nsr file at the base of that directory with "Skip: *.* "
> inside of that to try to get it to skip that directory altogether. Over the
> weekend, during its level 5 backup, it gave the message that it had
> parsed the .nsr file but still also gave a failure.

What failure? Can you provide the error message(s)?

Is what you show above as the contents of the .nsr file accurate?
I don't know if "Skip" (with a capital S) will match the "skip" ASM,
documented in uasm(8).

To understand directives, you probably want to read man pages in this order:

nsr_directive(5) # first, to get an intro to directives
nsr(5) # a lot better documentation & examples
uasm(8) # for the list of ASMs available and what they do

> My questions are:
> how can I tell if
> the backup actually failed?

Well, the NetWorker Administration GUI's "Monitoring" tab doesn't keep
historical information, it only has info for the most recent backups, but
if you catch it right after the backup group has run, it usually has
the log messages you're looking for.

Assuming a pretty vanilla configuration, NetWorker is probably generating
a few log files in the /nsr/logs directory on your NetWorker server. If
there's a file named "messages" in there, it probably has the summary log
messages for every client, possibly going back to the dawn of time if
you aren't doing anything to rotate or truncate that file. If you look
through that file, there should be log messages for the date & client
in question.

If there's a log file named 'daemon.raw', you can run that through the
'nsr_render_log' command-line tool to output the raw log entries into
something that's a bit closer to a traditional log file.

> And is there any other way to get it to skip that
> directory and return a successful backup?

Sure, but it would be best to understand what the actual failure is before
giving some advice that could potentially cause you problems in the
future.

At this point, I'm still not convinced that it's the "size grew during
save" that is what's really causing the problem.

Tim
--
Tim Mooney Tim.Mooney@ndsu.edu
Enterprise Computing & Infrastructure 701-231-1076 (Voice)
Room 242-J6, Quentin Burdick Building 701-231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164


--
This list is hosted as a public service at Temple University by Stan Horwitz
If you wish to sign off this list or adjust your subscription settings, please do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
If you have any questions regarding management of this list, please send email to owner-emc-dataprotection-l@listserv.temple.edu
This message was imported via the External PhorumMail Module
Jim Weill
Re: Backups fail on specific directory
February 20, 2018 04:59PM
I also do not believe the "size grew during save" message is the real
failure but it's the closest thing I have to work with.  Our big backups
happen on the weekend, so I don't know that I'll ever be around when the
details are saved in the monitoring tab of the GUI.  I have the
following for the logs, though:

The messages file is 0 bytes and looks to have been gzipped/archived
back in 2015.

daemon.raw gives the same message I've been seeing "size grew during
save" with the additional information of looking in
/nsr/logs/sg/savegroup/[number-filename].  Looking in those files only
gives the same theme:

66135:save: NSR directive file (/path/to/logs/.nsr) parsed
66135:save: NSR directive file (/path/to/logs/fqdn-client-name/.nsr) parsed

We get the savegroup logs emailed out nightly after each job finishes
cloning.  The night in question provided this

--- Unsuccessful Save Sets ---

* fqdn-client-name:/path 66135:save: NSR directive file
(/path/to/logs/.nsr) parsed
* fqdn-client-name:/path 66135:save: NSR directive file
(/path/to_archive/logs/fqdn-client-name/.nsr) parsed
* fqdn-client-name:/path --- Job Indications ---
  fqdn-client-name:/path: retried 1 times.


On 2/20/2018 2:08 PM, Tim Mooney wrote:
> In regard to: [EMC-DataProtection-L] Backups fail on specific
> directory,...:
>
>> We've had a single specific Linux server which fails its "big" backups
>> (level 3, 5, and full) on a regular basis because a specific directory
>> fills up faster than than the networker daemon can keep up.  At least,
>> that's the error we get "size grew during save".
>
> Size grew during save isn't something that I've ever seen cause an
> entire backup to fail.  Usually, that's a warning that means that the
> file(s) in question are changing while the backup is going on, which
> means
> that a restore of just those files may not correctly recover them.  The
> rest of the backup is normally still recoverable.
>
>> I've inserted a .nsr file at the base of that directory with "Skip:
>> *.* " inside of that to try to get it to skip that directory
>> altogether.  Over the weekend, during its level 5 backup, it gave the
>> message that it had
>> parsed the .nsr file but still also gave a failure.
>
> What failure?  Can you provide the error message(s)?
>
> Is what you show above as the contents of the .nsr file accurate?
> I don't know if "Skip" (with a capital S) will match the "skip" ASM,
> documented in uasm(8).
>
> To understand directives, you probably want to read man pages in this
> order:
>
>     nsr_directive(5)    # first, to get an intro to directives
>     nsr(5)                # a lot better documentation & examples
>     uasm(8)                # for the list of ASMs available and what
> they do
>
>>   My questions are:
>> how can I tell if the backup actually failed?
>
> Well, the NetWorker Administration GUI's "Monitoring" tab doesn't keep
> historical information, it only has info for the most recent backups, but
> if you catch it right after the backup group has run, it usually has
> the log messages you're looking for.
>
> Assuming a pretty vanilla configuration, NetWorker is probably generating
> a few log files in the /nsr/logs directory on your NetWorker server.  If
> there's a file named "messages" in there, it probably has the summary log
> messages for every client, possibly going back to the dawn of time if
> you aren't doing anything to rotate or truncate that file.  If you look
> through that file, there should be log messages for the date & client
> in question.
>
> If there's a log file named 'daemon.raw', you can run that through the
> 'nsr_render_log' command-line tool to output the raw log entries into
> something that's a bit closer to a traditional log file.
>
>>   And is there any other way to get it to skip that directory and
>> return a successful backup?
>
> Sure, but it would be best to understand what the actual failure is
> before
> giving some advice that could potentially cause you problems in the
> future.
>
> At this point, I'm still not convinced that it's the "size grew during
> save" that is what's really causing the problem.
>
> Tim


--
This list is hosted as a public service at Temple University by Stan Horwitz
If you wish to sign off this list or adjust your subscription settings, please do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
If you have any questions regarding management of this list, please send email to owner-emc-dataprotection-l@listserv.temple.edu
This message was imported via the External PhorumMail Module
Re: Backups fail on specific directory
February 20, 2018 04:59PM
In regard to: Re: [EMC-DataProtection-L] Backups fail on specific...:

> We get the savegroup logs emailed out nightly after each job finishes
> cloning.

OK, good. Not everyone does that, but the emailed backup output will
have the same info.

> --- Unsuccessful Save Sets ---
>
> * fqdn-client-name:/path 66135:save: NSR directive file (/path/to/logs/.nsr)
> parsed
> * fqdn-client-name:/path 66135:save: NSR directive file
> (/path/to_archive/logs/fqdn-client-name/.nsr) parsed
> * fqdn-client-name:/path --- Job Indications ---
>   fqdn-client-name:/path: retried 1 times.

Yeah, not much to go on there.

In your situation, I would try running a verbose backup probe of the
client in question, to see if that turns anything up.

On your backup server, it would be something like

# as root or via 'sudo'
savegrp -p -vvv -l 5 -c your.client.name.here 'the backup group name' 2>&1 | tee /tmp/savegrp-p-vvv.txt

You'll need to know the name of the backup group the client is in. The
'-l 5' says this should be a level 5, but the '-p' says just
probe/preview.

That will hopefully give you the additional information needed to
determine why the backup is failing.

Only do this if this particular client does NOT have any pre- or
post-client backup commands (it's not using 'savepnpc'), as they are
run even when you do a preview backup, and that might cause you problems
if done during production business hours.

Back in the old days, we used to have problems with backups failing
because of timeouts -- the save was taking a long time to walk the
filesystem and wasn't sending any data back to the server, so the
NetWorker server would abort the backup. That's more likely to happen
on incrementals though (less data changed) than on fulls or a level 5,
and it's not a phenomenon we've seen in years.

After you run the preview backup, if there isn't anything useful in the
output from savegrp, another thing you should probably do is check the
logs on the client. There should be a log file under /nsr/logs on the
client too and it may have something relevant as to why this is failing.

Tim
--
Tim Mooney Tim.Mooney@ndsu.edu
Enterprise Computing & Infrastructure 701-231-1076 (Voice)
Room 242-J6, Quentin Burdick Building 701-231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164


--
This list is hosted as a public service at Temple University by Stan Horwitz
If you wish to sign off this list or adjust your subscription settings, please do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
If you have any questions regarding management of this list, please send email to owner-emc-dataprotection-l@listserv.temple.edu
This message was imported via the External PhorumMail Module
Jim Weill
Re: Backups fail on specific directory
February 20, 2018 04:59PM
I think this is where I point out that this particular server fails at
specific times of day -- typically the interval between midnight and
4am.  That's when the bulk of the disk i/o happens, and coincides with
the chosen time of savesets getting sent to staging to provide the least
disruption to the user experience. Running a backup during regular hours
always succeeds, is there a way to run verbose on the saveset itself
rather than manually?

jim


On 2/20/2018 4:51 PM, Tim Mooney wrote:
> In regard to: Re: [EMC-DataProtection-L] Backups fail on specific...:
>
>> We get the savegroup logs emailed out nightly after each job finishes
>> cloning.
>
> OK, good.   Not everyone does that, but the emailed backup output will
> have the same info.
>
>> --- Unsuccessful Save Sets ---
>>
>> * fqdn-client-name:/path 66135:save: NSR directive file
>> (/path/to/logs/.nsr) parsed
>> * fqdn-client-name:/path 66135:save: NSR directive file
>> (/path/to_archive/logs/fqdn-client-name/.nsr) parsed
>> * fqdn-client-name:/path --- Job Indications ---
>>   fqdn-client-name:/path: retried 1 times.
>
> Yeah, not much to go on there.
>
> In your situation, I would try running a verbose backup probe of the
> client in question, to see if that turns anything up.
>
> On your backup server, it would be something like
>
>     # as root or via 'sudo'
>     savegrp -p -vvv -l 5 -c your.client.name.here 'the backup group
> name' 2>&1 | tee /tmp/savegrp-p-vvv.txt
>
> You'll need to know the name of the backup group the client is in.  The
> '-l 5' says this should be a level 5, but the '-p' says just
> probe/preview.
>
> That will hopefully give you the additional information needed to
> determine why the backup is failing.
>
> Only do this if this particular client does NOT have any pre- or
> post-client backup commands (it's not using 'savepnpc'), as they are
> run even when you do a preview backup, and that might cause you problems
> if done during production business hours.
>
> Back in the old days, we used to have problems with backups failing
> because of timeouts -- the save was taking a long time to walk the
> filesystem and wasn't sending any data back to the server, so the
> NetWorker server would abort the backup.  That's more likely to happen
> on incrementals though (less data changed) than on fulls or a level 5,
> and it's not a phenomenon we've seen in years.
>
> After you run the preview backup, if there isn't anything useful in the
> output from savegrp, another thing you should probably do is check the
> logs on the client.  There should be a log file under /nsr/logs on the
> client too and it may have something relevant as to why this is failing.
>
> Tim


--
This list is hosted as a public service at Temple University by Stan Horwitz
If you wish to sign off this list or adjust your subscription settings, please do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
If you have any questions regarding management of this list, please send email to owner-emc-dataprotection-l@listserv.temple.edu
This message was imported via the External PhorumMail Module
Re: Backups fail on specific directory
February 20, 2018 05:59PM
In regard to: Re: [EMC-DataProtection-L] Backups fail on specific...:

> I think this is where I point out that this particular server fails at
> specific times of day -- typically the interval between midnight and 4am. 
> That's when the bulk of the disk i/o happens, and coincides with the chosen
> time of savesets getting sent to staging to provide the least disruption to
> the user experience. Running a backup during regular hours always
> succeeds, is there a way to run verbose on the saveset itself rather
> than manually?

Sure, you can customize what command is run, per client, to do the actual
backup. It defaults to "save", but you can create a wrapper (on the
client) named something like "save-verbose.sh" that just invokes save with
all the same arguments save-verbose.sh received *plus* '-vvv'.

Check out the "Customizing the Backup Command" section of
"EMC NetWorker Version 8.2SP1 and later Administration Guide". It should
be in the first chapter, and the Example 2 is probably closer to what
you want.

Keep in mind that once you have your script in place on the client,
in the same directory as the existing "save" command, you need to update
the "Backup command" setting of the client. It's on the "Apps and
Modules" tab of the client config within the NetWorker Administrator GUI.

Tim

> On 2/20/2018 4:51 PM, Tim Mooney wrote:
>> In regard to: Re: [EMC-DataProtection-L] Backups fail on specific...:
>>
>>> We get the savegroup logs emailed out nightly after each job finishes
>>> cloning.
>>
>> OK, good.   Not everyone does that, but the emailed backup output will
>> have the same info.
>>
>>> --- Unsuccessful Save Sets ---
>>>
>>> * fqdn-client-name:/path 66135:save: NSR directive file
>>> (/path/to/logs/.nsr) parsed
>>> * fqdn-client-name:/path 66135:save: NSR directive file
>>> (/path/to_archive/logs/fqdn-client-name/.nsr) parsed
>>> * fqdn-client-name:/path --- Job Indications ---
>>>   fqdn-client-name:/path: retried 1 times.
>>
>> Yeah, not much to go on there.
>>
>> In your situation, I would try running a verbose backup probe of the
>> client in question, to see if that turns anything up.
>>
>> On your backup server, it would be something like
>>
>>     # as root or via 'sudo'
>>     savegrp -p -vvv -l 5 -c your.client.name.here 'the backup group name'
>> 2>&1 | tee /tmp/savegrp-p-vvv.txt
>>
>> You'll need to know the name of the backup group the client is in.  The
>> '-l 5' says this should be a level 5, but the '-p' says just
>> probe/preview.
>>
>> That will hopefully give you the additional information needed to
>> determine why the backup is failing.
>>
>> Only do this if this particular client does NOT have any pre- or
>> post-client backup commands (it's not using 'savepnpc'), as they are
>> run even when you do a preview backup, and that might cause you problems
>> if done during production business hours.
>>
>> Back in the old days, we used to have problems with backups failing
>> because of timeouts -- the save was taking a long time to walk the
>> filesystem and wasn't sending any data back to the server, so the
>> NetWorker server would abort the backup.  That's more likely to happen
>> on incrementals though (less data changed) than on fulls or a level 5,
>> and it's not a phenomenon we've seen in years.
>>
>> After you run the preview backup, if there isn't anything useful in the
>> output from savegrp, another thing you should probably do is check the
>> logs on the client.  There should be a log file under /nsr/logs on the
>> client too and it may have something relevant as to why this is failing.
>>
>> Tim
>
>
> --
> This list is hosted as a public service at Temple University by Stan Horwitz
> If you wish to sign off this list or adjust your subscription settings, please
> do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
> If you have any questions regarding management of this list, please send email
> to owner-emc-dataprotection-l@listserv.temple.edu
>

--
Tim Mooney Tim.Mooney@ndsu.edu
Enterprise Computing & Infrastructure 701-231-1076 (Voice)
Room 242-J6, Quentin Burdick Building 701-231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164


--
This list is hosted as a public service at Temple University by Stan Horwitz
If you wish to sign off this list or adjust your subscription settings, please do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
If you have any questions regarding management of this list, please send email to owner-emc-dataprotection-l@listserv.temple.edu
This message was imported via the External PhorumMail Module
Jim Weill
Re: Backups fail on specific directory
March 05, 2018 08:59AM
I turned on verbosity for the savegroup within the NMC instead. And I
think my problem was putting the .nsr file *inside* the directory to
skip all the files, rather than one level up and skip the whole
directory.  This weekend's level 3 backup did not fail, and the verbose
logs show me that the logs path did not get traversed.  Will have to
check this next weekend to be sure, but I think it might be working now.

jim


On 2/20/2018 5:18 PM, Tim Mooney wrote:
> In regard to: Re: [EMC-DataProtection-L] Backups fail on specific...:
>
>> I think this is where I point out that this particular server fails
>> at specific times of day -- typically the interval between midnight
>> and 4am.  That's when the bulk of the disk i/o happens, and coincides
>> with the chosen time of savesets getting sent to staging to provide
>> the least disruption to the user experience. Running a backup during
>> regular hours always
>> succeeds, is there a way to run verbose on the saveset itself rather
>> than manually?
>
> Sure, you can customize what command is run, per client, to do the actual
> backup.  It defaults to "save", but you can create a wrapper (on the
> client) named something like "save-verbose.sh" that just invokes save
> with
> all the same arguments save-verbose.sh received *plus* '-vvv'.
>
> Check out the "Customizing the Backup Command" section of
> "EMC NetWorker Version 8.2SP1 and later Administration Guide".  It should
> be in the first chapter, and the Example 2 is probably closer to what
> you want.
>
> Keep in mind that once you have your script in place on the client,
> in the same directory as the existing "save" command, you need to update
> the "Backup command" setting of the client.  It's on the "Apps and
> Modules" tab of the client config within the NetWorker Administrator GUI.
>
> Tim
>
>> On 2/20/2018 4:51 PM, Tim Mooney wrote:
>>> In regard to: Re: [EMC-DataProtection-L] Backups fail on specific...:
>>>
>>>> We get the savegroup logs emailed out nightly after each job
>>>> finishes cloning.
>>>
>>> OK, good.   Not everyone does that, but the emailed backup output will
>>> have the same info.
>>>
>>>> --- Unsuccessful Save Sets ---
>>>>
>>>> * fqdn-client-name:/path 66135:save: NSR directive file
>>>> (/path/to/logs/.nsr) parsed
>>>> * fqdn-client-name:/path 66135:save: NSR directive file
>>>> (/path/to_archive/logs/fqdn-client-name/.nsr) parsed
>>>> * fqdn-client-name:/path --- Job Indications ---
>>>>   fqdn-client-name:/path: retried 1 times.
>>>
>>> Yeah, not much to go on there.
>>>
>>> In your situation, I would try running a verbose backup probe of the
>>> client in question, to see if that turns anything up.
>>>
>>> On your backup server, it would be something like
>>>
>>>     # as root or via 'sudo'
>>>     savegrp -p -vvv -l 5 -c your.client.name.here 'the backup group
>>> name' 2>&1 | tee /tmp/savegrp-p-vvv.txt
>>>
>>> You'll need to know the name of the backup group the client is in.  The
>>> '-l 5' says this should be a level 5, but the '-p' says just
>>> probe/preview.
>>>
>>> That will hopefully give you the additional information needed to
>>> determine why the backup is failing.
>>>
>>> Only do this if this particular client does NOT have any pre- or
>>> post-client backup commands (it's not using 'savepnpc'), as they are
>>> run even when you do a preview backup, and that might cause you
>>> problems
>>> if done during production business hours.
>>>
>>> Back in the old days, we used to have problems with backups failing
>>> because of timeouts -- the save was taking a long time to walk the
>>> filesystem and wasn't sending any data back to the server, so the
>>> NetWorker server would abort the backup.  That's more likely to happen
>>> on incrementals though (less data changed) than on fulls or a level 5,
>>> and it's not a phenomenon we've seen in years.
>>>
>>> After you run the preview backup, if there isn't anything useful in the
>>> output from savegrp, another thing you should probably do is check the
>>> logs on the client.  There should be a log file under /nsr/logs on the
>>> client too and it may have something relevant as to why this is
>>> failing.
>>>
>>> Tim
>>
>>
>> --
>> This list is hosted as a public service at Temple University by Stan
>> Horwitz
>> If you wish to sign off this list or adjust your subscription
>> settings, please do so via
>> http://listserv.temple.edu/archives/emc-dataprotection-l.html
>> If you have any questions regarding management of this list, please
>> send email to owner-emc-dataprotection-l@listserv.temple.edu
>>
>


--
This list is hosted as a public service at Temple University by Stan Horwitz
If you wish to sign off this list or adjust your subscription settings, please do so via http://listserv.temple.edu/archives/emc-dataprotection-l.html
If you have any questions regarding management of this list, please send email to owner-emc-dataprotection-l@listserv.temple.edu
This message was imported via the External PhorumMail Module
Sorry, only registered users may post in this forum.

Click here to login