SearchFAQMemberlist Log in
Reply to topic Page 1 of 1
Server lost connection problem
Author Message
Post Server lost connection problem 
About two weeks ago, I upgraded my NetWorker server from 7.6.1 to NetWorker 7.6.3.4.Build.879. This server backs up 336 clients (mostly Windows and Linux). All of the clients back up to a Data Domain system using Boost and a few are cloned nightly to LTO-5 tape. I have 11 Boost devices configured for direct use on the server and each Boost device has its max sessions set to a value of 10. No storage nodes are involved in this data zone.

After we upgraded our DD system to the latest OS, the backups of larger servers improved in their throughput, but for the past few days, I am noticing an unusual number of backup failures for several groups of, both Linux and Windows, including some that also have NetWorker 7.6.3 on them. The error is always the same in the savegroup report "connection dropped."

There does not appear to be anything problems going on with network connectivity and in most cases, these clients do not back up via a firewall. I do not see any errors on the clients or the NetWorker server when I use "netstat -i." Incremental backups of the same clients also work without issue.

I reviewed the NetWorker tuning guide on PowerLink, but I haven't done any of the tests they recommended yet with uasm, although it did contain a recommendation to increase client parallelism to 12, which I changed a few minutes ago. Most of the clients had the default setting of 4 for their parallelism.

If anyone has any ideas on how to investigate this problem, please let me know. I am skeptical that doing tests with uasm will bring forth any enlightenment on this issue.

View user's profile Send private message
Post Server lost connection problem 
Hello,
At first glance you would think it'd be a timeout issue that adjusting the keep alive values would fix. Being that you just updated to a new DDOS you may want to see if there are any errors being reported on the data domain side. Also, see if there is a certain time when these errors happen. Meaning, are there clients that kick off at 5pm and the error happens at 6pm and all the client's backup that was running at that time cancel with the "connection dropped" error? You didn't mention anything about the data domain devices going offline, but if there was an issue with the networker server not talking to data domain your devices should go offline. But if you have "auto media management" enabled on the dd devices networker will attempt to bring them back online.

I would think that increasing the client parallelism would add to the problem instead of help it. Increasing the client parallelism will cause you to have more streams going to the dd box, which may slow your backup down. If any parallelism needs to be adjusted I would do it from the group level and not the client level. Is it possible you have your group parallelism set to 0 and some of the clients could be waiting on resources and timing out?

-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER < at > LISTSERV.TEMPLE.EDU] On Behalf Of Stanley R. Horwitz
Sent: Tuesday, June 19, 2012 10:11 AM
To: NETWORKER < at > LISTSERV.TEMPLE.EDU
Subject: [Networker] Server lost connection problem

About two weeks ago, I upgraded my NetWorker server from 7.6.1 to NetWorker 7.6.3.4.Build.879. This server backs up 336 clients (mostly Windows and Linux). All of the clients back up to a Data Domain system using Boost and a few are cloned nightly to LTO-5 tape. I have 11 Boost devices configured for direct use on the server and each Boost device has its max sessions set to a value of 10. No storage nodes are involved in this data zone.

After we upgraded our DD system to the latest OS, the backups of larger servers improved in their throughput, but for the past few days, I am noticing an unusual number of backup failures for several groups of, both Linux and Windows, including some that also have NetWorker 7.6.3 on them. The error is always the same in the savegroup report "connection dropped."

There does not appear to be anything problems going on with network connectivity and in most cases, these clients do not back up via a firewall. I do not see any errors on the clients or the NetWorker server when I use "netstat -i." Incremental backups of the same clients also work without issue.

I reviewed the NetWorker tuning guide on PowerLink, but I haven't done any of the tests they recommended yet with uasm, although it did contain a recommendation to increase client parallelism to 12, which I changed a few minutes ago. Most of the clients had the default setting of 4 for their parallelism.

If anyone has any ideas on how to investigate this problem, please let me know. I am skeptical that doing tests with uasm will bring forth any enlightenment on this issue.

Post Server lost connection problem 
Hi Chester,

This seems to occur at different times of the day and night. I agree that increasing client parallelism doesn't make much sense, but perhaps it is one of those counterintuitive situations. The savegroup parallelism is set to 10 for each savegroup. Auto media management is not enabled on my DD Boost devices, but nothing in the logs on the NetWorker server suggess a problem in that regard. I am going to ask my SAN manager to look at the DD system to try to ascertain if it is in good health, but the daily health report emails I get from it do not indicate any sort of a problem.

On 06 19, 2012, at 1:39 PM, Chester Martin wrote:

Hello,
At first glance you would think it'd be a timeout issue that adjusting the keep alive values would fix. Being that you just updated to a new DDOS you may want to see if there are any errors being reported on the data domain side. Also, see if there is a certain time when these errors happen. Meaning, are there clients that kick off at 5pm and the error happens at 6pm and all the client's backup that was running at that time cancel with the "connection dropped" error? You didn't mention anything about the data domain devices going offline, but if there was an issue with the networker server not talking to data domain your devices should go offline. But if you have "auto media management" enabled on the dd devices networker will attempt to bring them back online.

I would think that increasing the client parallelism would add to the problem instead of help it. Increasing the client parallelism will cause you to have more streams going to the dd box, which may slow your backup down. If any parallelism needs to be adjusted I would do it from the group level and not the client level. Is it possible you have your group parallelism set to 0 and some of the clients could be waiting on resources and timing out?

View user's profile Send private message
Post Server lost connection problem 
If it happens multiple times during the night and the communication between networker and dd is not going down would mean a client timeout issue, but if this started after upgrading the ddos I would think networker has a problem talking with the new ddos. How's the health of the networker server? With backing up that many clients and handling indexes that's putting a little bit of a load on it if it's not a beefy box.

I also noticed something in my last post I need to clear up. When I said "Increasing the client parallelism will cause you to have more streams going to the dd box, which may slow your backup down" this is not entirely true the way I worded it. I didn't mean having more streams going to the dd box will slow your backup down, but having more streams coming out of the client will slow your backup down. My fingers can't type what my mind is telling it.. Smile

-----Original Message-----
From: Stanley R. Horwitz [mailto:stan < at > temple.edu]
Sent: Tuesday, June 19, 2012 12:58 PM
To: EMC NetWorker discussion; Chester Martin
Subject: Re: [Networker] Server lost connection problem

Hi Chester,

This seems to occur at different times of the day and night. I agree that increasing client parallelism doesn't make much sense, but perhaps it is one of those counterintuitive situations. The savegroup parallelism is set to 10 for each savegroup. Auto media management is not enabled on my DD Boost devices, but nothing in the logs on the NetWorker server suggess a problem in that regard. I am going to ask my SAN manager to look at the DD system to try to ascertain if it is in good health, but the daily health report emails I get from it do not indicate any sort of a problem.

On 06 19, 2012, at 1:39 PM, Chester Martin wrote:

Hello,
At first glance you would think it'd be a timeout issue that adjusting the keep alive values would fix. Being that you just updated to a new DDOS you may want to see if there are any errors being reported on the data domain side. Also, see if there is a certain time when these errors happen. Meaning, are there clients that kick off at 5pm and the error happens at 6pm and all the client's backup that was running at that time cancel with the "connection dropped" error? You didn't mention anything about the data domain devices going offline, but if there was an issue with the networker server not talking to data domain your devices should go offline. But if you have "auto media management" enabled on the dd devices networker will attempt to bring them back online.

I would think that increasing the client parallelism would add to the problem instead of help it. Increasing the client parallelism will cause you to have more streams going to the dd box, which may slow your backup down. If any parallelism needs to be adjusted I would do it from the group level and not the client level. Is it possible you have your group parallelism set to 0 and some of the clients could be waiting on resources and timing out?

Post Server lost connection problem 
This issue turned out to be the result of many clients being mistaken for intruders as a result of a recent update in our intrusion detection system.

On 06 19, 2012, at 1:39 PM, Chester Martin wrote:

Hello,
At first glance you would think it'd be a timeout issue that adjusting the keep alive values would fix. Being that you just updated to a new DDOS you may want to see if there are any errors being reported on the data domain side. Also, see if there is a certain time when these errors happen. Meaning, are there clients that kick off at 5pm and the error happens at 6pm and all the client's backup that was running at that time cancel with the "connection dropped" error? You didn't mention anything about the data domain devices going offline, but if there was an issue with the networker server not talking to data domain your devices should go offline. But if you have "auto media management" enabled on the dd devices networker will attempt to bring them back online.

I would think that increasing the client parallelism would add to the problem instead of help it. Increasing the client parallelism will cause you to have more streams going to the dd box, which may slow your backup down. If any parallelism needs to be adjusted I would do it from the group level and not the client level. Is it possible you have your group parallelism set to 0 and some of the clients could be waiting on resources and timing out?

-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER < at > LISTSERV.TEMPLE.EDU] On Behalf Of Stanley R. Horwitz
Sent: Tuesday, June 19, 2012 10:11 AM
To: NETWORKER < at > LISTSERV.TEMPLE.EDU
Subject: [Networker] Server lost connection problem

About two weeks ago, I upgraded my NetWorker server from 7.6.1 to NetWorker 7.6.3.4.Build.879. This server backs up 336 clients (mostly Windows and Linux). All of the clients back up to a Data Domain system using Boost and a few are cloned nightly to LTO-5 tape. I have 11 Boost devices configured for direct use on the server and each Boost device has its max sessions set to a value of 10. No storage nodes are involved in this data zone.

After we upgraded our DD system to the latest OS, the backups of larger servers improved in their throughput, but for the past few days, I am noticing an unusual number of backup failures for several groups of, both Linux and Windows, including some that also have NetWorker 7.6.3 on them. The error is always the same in the savegroup report "connection dropped."

There does not appear to be anything problems going on with network connectivity and in most cases, these clients do not back up via a firewall. I do not see any errors on the clients or the NetWorker server when I use "netstat -i." Incremental backups of the same clients also work without issue.

I reviewed the NetWorker tuning guide on PowerLink, but I haven't done any of the tests they recommended yet with uasm, although it did contain a recommendation to increase client parallelism to 12, which I changed a few minutes ago. Most of the clients had the default setting of 4 for their parallelism.

If anyone has any ideas on how to investigate this problem, please let me know. I am skeptical that doing tests with uasm will bring forth any enlightenment on this issue.

View user's profile Send private message
Display posts from previous:
Reply to topic Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB