SearchFAQMemberlist Log in
Reply to topic Page 1 of 1
Mysteriously failing jobs
Author Message
Post Mysteriously failing jobs 
A couple of weeks ago, a problem started cropping up. Jobs started failing
with what look like network errors:

02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
append.c:259 Network error on data channel. ERR=Input/output error
02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate =
4.157 M bytes/second
02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read
expected 65536 got 16384 from client:130.215.39.18:36643
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network
error with FD during Backup: ERR=No data available
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job
status returned from FD.
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3
(06Mar07): 02-Jun-2007 01:10:40

However, I can find no evidence of any actual network problem between the
machine running the fd and the one running both the sd and dir:

- The network monitoring system shows no outages, and none of the switches
and routers in between show anything out of the ordinary in the logs.

- There is no external firewall between the two system. Both ends are linux
2.6 with iptables, with non-stateful rules for all bacula traffic.

- IP flow logs show that both ends of the FD -> SD TCP connection
ungracefully closed down the stream with a RST after a very short idle period
of about 10 seconds.

- I've already tried swapping to a different NIC on the server to rule out a
dying network card.

- The failure occurs on different machines, ruling out something specific to
one client, though it usually appears to affect the same one. More
specifically, it always seems to die around the same time - about ten minutes
after the batch of nightly jobs start. I have things configured to run four
concurrent jobs, and the failures will cancel anywhere from one to four jobs.
When multiple jobs die, they all do so at the same time. I can influence
which clients get picked on by shuffling around priorities.

- Running the failed job - either by itself or queued up with a bunch of
other ones - always appear to work as expected.

The part *really* driving me bonkers is that I can find no evidence of any
changes that coincide with the problem starting. Bacula version, kernel
version, hardware, network - nothing was changed.

If anyone has any suggestions where I could start looking, I'd love to hear them.

--
Frank Sweetser fs at wpi.edu | For every problem, there is a solution that
WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC

Post Mysteriously failing jobs 
Hi,

On 6/2/2007 7:43 AM, Frank Sweetser wrote:
A couple of weeks ago, a problem started cropping up. Jobs started failing
with what look like network errors:

02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
append.c:259 Network error on data channel. ERR=Input/output error
02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate =
4.157 M bytes/second
02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read
expected 65536 got 16384 from client:130.215.39.18:36643
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network
error with FD during Backup: ERR=No data available
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job
status returned from FD.
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3
(06Mar07): 02-Jun-2007 01:10:40


However, I can find no evidence of any actual network problem between the
machine running the fd and the one running both the sd and dir:

- The network monitoring system shows no outages, and none of the switches
and routers in between show anything out of the ordinary in the logs.

- There is no external firewall between the two system. Both ends are linux
2.6 with iptables, with non-stateful rules for all bacula traffic.

- IP flow logs show that both ends of the FD -> SD TCP connection
ungracefully closed down the stream with a RST after a very short idle period
of about 10 seconds.

- I've already tried swapping to a different NIC on the server to rule out a
dying network card.

- The failure occurs on different machines, ruling out something specific to
one client, though it usually appears to affect the same one. More
specifically, it always seems to die around the same time - about ten minutes
after the batch of nightly jobs start. I have things configured to run four
concurrent jobs, and the failures will cancel anywhere from one to four jobs.
When multiple jobs die, they all do so at the same time. I can influence
which clients get picked on by shuffling around priorities.

Well, this one looks difficult.

I suggest to monitor the memory usage of your server. I experienced
problems with (usually) the DIR or (seldomly) the SD using up all
available memory. Wich probably might affect the kernel so that it can't
allocate memory for the network stuff.

You should have something in the systems log files then, I suppose.

A work around would be to not start all your jobs at once but run them
in batches. Lowering job concurrency will not work as a job waiting for
an available slot to run will also use memory.

Also, you could try upgrading to the current development version as I
believe Kern worked on that problem. You should check the change log.

Hope you get this fixed,

Arno

- Running the failed job - either by itself or queued up with a bunch of
other ones - always appear to work as expected.

The part *really* driving me bonkers is that I can find no evidence of any
changes that coincide with the problem starting. Bacula version, kernel
version, hardware, network - nothing was changed.

If anyone has any suggestions where I could start looking, I'd love to hear them.


--
IT-Service Lehmann al < at > it...
Arno Lehmann http://www.its-lehmann.de

Post Mysteriously failing jobs 
Arno Lehmann wrote:
Well, this one looks difficult.

At least it's not just me, then =)

I suggest to monitor the memory usage of your server. I experienced
problems with (usually) the DIR or (seldomly) the SD using up all
available memory. Wich probably might affect the kernel so that it can't
allocate memory for the network stuff.

That would explain why nothing visibly changed. One or two jobs simply pushed
some internal resource over the magic threshold, and triggered the memory
consumption.

You should have something in the systems log files then, I suppose.

I didn't find anything that appeared related in the log files. I have a quick
and dirty system in place to monitor the memory usage of the dir and sd that
I'll run through tonight's jobs, so we'll see how that looks.

A work around would be to not start all your jobs at once but run them
in batches. Lowering job concurrency will not work as a job waiting for
an available slot to run will also use memory.

Also, you could try upgrading to the current development version as I
believe Kern worked on that problem. You should check the change log.

I think I might at least wait until Kern releases an official beta before
trying that one out =)

--
Frank Sweetser fs at wpi.edu | For every problem, there is a solution that
WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC

Post Mysteriously failing jobs 
Hi,

On 6/4/2007 6:17 PM, Frank Sweetser wrote:
Arno Lehmann wrote:
Well, this one looks difficult.

At least it's not just me, then =)

I suggest to monitor the memory usage of your server. I experienced
problems with (usually) the DIR or (seldomly) the SD using up all
available memory. Wich probably might affect the kernel so that it can't
allocate memory for the network stuff.

That would explain why nothing visibly changed. One or two jobs simply pushed
some internal resource over the magic threshold, and triggered the memory
consumption.

Quite possible, in my experience.

You should have something in the systems log files then, I suppose.

I didn't find anything that appeared related in the log files. I have a quick
and dirty system in place to monitor the memory usage of the dir and sd that
I'll run through tonight's jobs, so we'll see how that looks.

If you need a minimal Nagios plugin - I wrote some shell script for that
purpose once :-)

A work around would be to not start all your jobs at once but run them
in batches. Lowering job concurrency will not work as a job waiting for
an available slot to run will also use memory.

Also, you could try upgrading to the current development version as I
believe Kern worked on that problem. You should check the change log.

I think I might at least wait until Kern releases an official beta before
trying that one out =)

2.1.10 IS kind of a released beta version Smile but the next one is doe
soon...

Arno

--
IT-Service Lehmann al < at > it...
Arno Lehmann http://www.its-lehmann.de

Post Mysteriously failing jobs 
Arno Lehmann wrote:
If you need a minimal Nagios plugin - I wrote some shell script for that
purpose once :-)

Oddly enough, nothing actually crashes - a handfull of jobs fail, but all
subsequent ones go through just fne.

A work around would be to not start all your jobs at once but run them
in batches. Lowering job concurrency will not work as a job waiting for
an available slot to run will also use memory.

Also, you could try upgrading to the current development version as I
believe Kern worked on that problem. You should check the change log.
I think I might at least wait until Kern releases an official beta before
trying that one out =)

2.1.10 IS kind of a released beta version Smile but the next one is doe
soon...

I'll definitely give that a try if no other solutions pop up...

--
Frank Sweetser fs at wpi.edu | For every problem, there is a solution that
WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC

Post Mysteriously failing jobs 
Well, I had a failure last night while I was monitoring memory usage. I had a
script snagging the output of ps -o rss for both bacula-sd and bacula-dir
every 60 seconds. Based on that, memory usage for both jumped only by a few
megs when the jobs started. The dir was around 20M, and the sd around 13M.

I'll try to see if I can capture a failure with debug options at least on the
FD cranked up...

--
Frank Sweetser fs at wpi.edu | For every problem, there is a solution that
WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC

Post Mysteriously failing jobs 
Hello,

On 6/6/2007 3:38 PM, Frank Sweetser wrote:
Well, I had a failure last night while I was monitoring memory usage. I had a
script snagging the output of ps -o rss for both bacula-sd and bacula-dir
every 60 seconds. Based on that, memory usage for both jumped only by a few
megs when the jobs started. The dir was around 20M, and the sd around 13M.

Quite sane numbers. Well, that quite certainly rules out the idea of
memory consumption causing your problems.

I'll try to see if I can capture a failure with debug options at least on the
FD cranked up...

I'm not a good debugger user, but strace might be the next thing to
try... like capturing all socket operations, or something. Perhaps you
get to know if the error is cause by the OS on one end.

Or, alternatively, using tcpdump to find if the sequence numbers get out
of sync somewhere, which would cause a RST on both ends.

Arno
--
IT-Service Lehmann al < at > it...
Arno Lehmann http://www.its-lehmann.de

Post Mysteriously failing jobs 
Arno Lehmann wrote:
I'm not a good debugger user, but strace might be the next thing to
try... like capturing all socket operations, or something. Perhaps you
get to know if the error is cause by the OS on one end.

Knowing how verbose strace can be, I'm a little hesitant to jump right to that.

Or, alternatively, using tcpdump to find if the sequence numbers get out
of sync somewhere, which would cause a RST on both ends.

I'll try getting a headers only tcpdump from both ends. Hopefully that, along
with -d100 on the FD, will produce something insightful.

--
Frank Sweetser fs at wpi.edu | For every problem, there is a solution that
WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC

Post Mysteriously failing jobs 
Arno Lehmann wrote:

Or, alternatively, using tcpdump to find if the sequence numbers get out
of sync somewhere, which would cause a RST on both ends.

Okay, I got a tcpdump and logfile of -d1000 on the fd. I'm a little rusty
debugging TCP issues by hand, but I couldn't find anything that looked too out
of the ordinary.

In the logfile, the only thing that looked strange to me were these messages
(extra linebreaks added for readability):

ivanova-fd: backup.c:876 Send data to SD len=65536

ivanova-fd: message.c:606 Enter dispatch_msg type=4 msg=ivanova-fd: ERROR in
openssl.c:74 TLS read/write failure.: ERR=error:140943FC:SSL
routines:SSL3_READ_BYTES:sslv3 alert bad record mac

ivanova-fd: message.c:768 DIRECTOR for following msg: ivanova-fd: ERROR in
openssl.c:74 TLS read/write failure.: ERR=error:140943FC:SSL
routines:SSL3_READ_BYTES:sslv3 alert bad record mac

ivanova-fd: heartbeat.c:90 Got BNET_SIG 0 from SD

ivanova-fd: heartbeat.c:95 wait_intr=1 stop=1

ivanova-fd: backup.c:876 Send data to SD len=65536

The tcpdump and log files are at http://erwin.wpi.edu/~fs/bacula-crash/ if
anyone wants to take a closer look and see if I've missed anything. They're
about 14M total.

Anyone have any other ideas, or do I need to file a bug report on this one?

--
Frank Sweetser fs at wpi.edu | For every problem, there is a solution that
WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC

Display posts from previous:
Reply to topic Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB