Welcome! » Log In » Create A New Profile

Question on Replication: unsuccessful replication due to sessions terminated by admin

Posted by Bjørn Nachtwey 
Hi all,

we planned to switch from COPYPOOL to Replication for having a second
copy of the data, therefore we bought a new server that should become
the primary TSM/ISP server and then make the old one holding the replicates.

what we did:

we started by exporting the nodes, which worked well. But as the
"incremental" exports even took some time, we set up a replication from
old server "A" to the new one "B". For all nodes already exported we set
up the replication vice versa: TSM "B" replicates them to TSM "A".

well, the replication jobs did not finish, some data and files were
missing as long as we replicated using a node group. Now we use
replication for each single node and it works -- for most of them :-(

Replication the "bad" nodes from "TSM A" to "TSM B" first the sessions
hang for many minutes, sometimes even hours, then they got "terminated -
forced by administrator" (ANR0483W), e.g.:

05/13/2019 15:23:16    ANR2017I Administrator GK issued command:
REPLICATE NODE vsbck  (SESSION: 26128)
05/13/2019 15:23:16    ANR1626I The previous message (message number
2017) was repeated 1 times.
05/13/2019 15:23:16    ANR0984I Process 494 for Replicate Node started
in the BACKGROUND at 15:23:16. (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR2110I REPLICATE NODE started as process 494.
(SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR0408I Session 26184 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR0408I Session 26185 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR0408I Session 26186 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26187 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26188 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26189 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26190 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26191 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)

05/13/2019 15:24:57    ANR0483W Session 26187 for node SM283
(Linux/x86_64) terminated - forced by administrator. (SESSION: 26128,
PROCESS: 494)

on the target server we observe at that time:

13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error 104.
13.05.2019 15:25:51 ANR3178E A communication error occurred during
session 65294 with replication server TSM.
13.05.2019 15:25:51 ANR0479W Session 65294 for server TSM (Windows)
terminated - connection with server severed.
13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error 32.

=> Any idea why this replication aborts?

=> why is there a "socket abortion error"?


well, we already opened a SR case, send lots of logs and traces. as IBM
suspects a network problem, now both serves use a cross link connection
without nothing but NIC/GBICs, plugs and wires.

thanks & best

Bjørn

--
--------------------------------------------------------------------------------------------------
Bjørn Nachtwey

Arbeitsgruppe "IT-Infrastruktur“
Tel.: +49 551 201-2181, E-Mail:bjoern.nachtwey@gwdg.de
--------------------------------------------------------------------------------------------------
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL:http://www.gwdg.de
Tel.: +49 551 201-1510, Fax: +49 551 201-2150, E-Mail:gwdg@gwdg.de
Service-Hotline: Tel.: +49 551 201-1523, E-Mail:support@gwdg.de
Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598
--------------------------------------------------------------------------------------------------
Zertifiziert nach ISO 9001
--------------------------------------------------------------------------------------------------
This message was imported via the External PhorumMail Module
Did you use something like iperf with a long and heavy load? a bad nic or
driver might cause this, so it might still be the network.

On Mon, May 13, 2019 at 4:15 PM Bjørn Nachtwey <bjoern.nachtwey@gwdg.de>
wrote:

> Hi all,
>
> we planned to switch from COPYPOOL to Replication for having a second
> copy of the data, therefore we bought a new server that should become
> the primary TSM/ISP server and then make the old one holding the
> replicates.
>
> what we did:
>
> we started by exporting the nodes, which worked well. But as the
> "incremental" exports even took some time, we set up a replication from
> old server "A" to the new one "B". For all nodes already exported we set
> up the replication vice versa: TSM "B" replicates them to TSM "A".
>
> well, the replication jobs did not finish, some data and files were
> missing as long as we replicated using a node group. Now we use
> replication for each single node and it works -- for most of them :-(
>
> Replication the "bad" nodes from "TSM A" to "TSM B" first the sessions
> hang for many minutes, sometimes even hours, then they got "terminated -
> forced by administrator" (ANR0483W), e.g.:
>
> 05/13/2019 15:23:16 ANR2017I Administrator GK issued command:
> REPLICATE NODE vsbck (SESSION: 26128)
> 05/13/2019 15:23:16 ANR1626I The previous message (message number
> 2017) was repeated 1 times.
> 05/13/2019 15:23:16 ANR0984I Process 494 for Replicate Node started
> in the BACKGROUND at 15:23:16. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:16 ANR2110I REPLICATE NODE started as process 494.
> (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:16 ANR0408I Session 26184 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:16 ANR0408I Session 26185 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:16 ANR0408I Session 26186 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:17 ANR0408I Session 26187 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:17 ANR0408I Session 26188 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:17 ANR0408I Session 26189 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:17 ANR0408I Session 26190 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
> 05/13/2019 15:23:17 ANR0408I Session 26191 started for server SM283
> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>
> 05/13/2019 15:24:57 ANR0483W Session 26187 for node SM283
> (Linux/x86_64) terminated - forced by administrator. (SESSION: 26128,
> PROCESS: 494)
>
> on the target server we observe at that time:
>
> 13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error
> 104.
> 13.05.2019 15:25:51 ANR3178E A communication error occurred during
> session 65294 with replication server TSM.
> 13.05.2019 15:25:51 ANR0479W Session 65294 for server TSM (Windows)
> terminated - connection with server severed.
> 13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error 32.
>
> => Any idea why this replication aborts?
>
> => why is there a "socket abortion error"?
>
>
> well, we already opened a SR case, send lots of logs and traces. as IBM
> suspects a network problem, now both serves use a cross link connection
> without nothing but NIC/GBICs, plugs and wires.
>
> thanks & best
>
> Bjørn
>
> --
>
> --------------------------------------------------------------------------------------------------
> Bjørn Nachtwey
>
> Arbeitsgruppe "IT-Infrastruktur“
> Tel.: +49 551 201-2181, E-Mail:bjoern.nachtwey@gwdg.de
>
> --------------------------------------------------------------------------------------------------
> Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
> Am Faßberg 11, 37077 Göttingen, URL:http://www.gwdg.de
> Tel.: +49 551 201-1510, Fax: +49 551 201-2150, E-Mail:gwdg@gwdg.de
> Service-Hotline: Tel.: +49 551 201-1523, E-Mail:support@gwdg.de
> Geschäftsführer: Prof. Dr. Ramin Yahyapour
> Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
> Sitz der Gesellschaft: Göttingen
> Registergericht: Göttingen, Handelsregister-Nr. B 598
>
> --------------------------------------------------------------------------------------------------
> Zertifiziert nach ISO 9001
>
> --------------------------------------------------------------------------------------------------
>
This message was imported via the External PhorumMail Module
Hi,

ok, i need to share some additional information :-)

first i observed a huge number of dropped packages on one server. so we
did some tests including iperf which showed nearly the physical
bandwidth. we also took a tcpdump and handed it to our network guys who
didn't see anything giving a hint but that there was broadcast traffic
which might cause the dropped packages. Starting with a LACP trunk we
broke it and tried both NICs seperately, switched the SFP modules als
well vice versa as also tested completely different ones.

then we switched to a dedicated private network for the ISP/TSM
server2server traffic only, this means both servers and one dedicated
switch -- on this connection no packages got lost or were dropped. for
this setup we also took traces, but got no helpful answer.

The last approach was to connect both servers directly by crossing the
fiber cables, but the problem still remains.

by now the ordinary client traffic is handled on each server using one
nic and for the server2server connection we have this crosslink
connection on a second nic.

i do wonder because the export of the nodes run as expected: got
suspended when not enough drives were available or the staging was full
on the destination server, but finished without any problems, as well
for small nodes as for large one (> 10 TB primary data and/or > 10 mio.
files). more silly: some replications do work, replication larger nodes.
If we have a problem with the driver, the export's shouldn't finish and
all replications should run in an error, shouldn't they?


my problem is, that i got no idea why the replication fails. The error
messages are not clear to me.


APAR IC920088
(https://www-01.ibm.com/support/docview.wss?uid=swg1IC92088) says it's
caused by network timeouts, but this note is about TSM6.3 -- an "This
problem was fixed". Unfortunately the corresponding technote
(http://www-01.ibm.com/support/docview.wss?uid=swg1642715) isn't
available any more.

well we also increased different timeout setting on the target server

CommTimeOut 600
AdminIdleTimeOut 180
AdminCommTimeOut 180

now i will increase IdleTimeOut on 600, too.

but due to the option "KeepAliveInterval 30" i expect idle connection
where refreshed every 5 minutes, so within the idleTimeOuts, especially
"KeepAliveTime 300" -- on both servers.


thanks & best,

Bjørn



Stefan Folkerts wrote:
> Did you use something like iperf with a long and heavy load? a bad nic or
> driver might cause this, so it might still be the network.
>
> On Mon, May 13, 2019 at 4:15 PM Bjørn Nachtwey<bjoern.nachtwey@gwdg.de>
> wrote:
>
>> Hi all,
>>
>> we planned to switch from COPYPOOL to Replication for having a second
>> copy of the data, therefore we bought a new server that should become
>> the primary TSM/ISP server and then make the old one holding the
>> replicates.
>>
>> what we did:
>>
>> we started by exporting the nodes, which worked well. But as the
>> "incremental" exports even took some time, we set up a replication from
>> old server "A" to the new one "B". For all nodes already exported we set
>> up the replication vice versa: TSM "B" replicates them to TSM "A".
>>
>> well, the replication jobs did not finish, some data and files were
>> missing as long as we replicated using a node group. Now we use
>> replication for each single node and it works -- for most of them :-(
>>
>> Replication the "bad" nodes from "TSM A" to "TSM B" first the sessions
>> hang for many minutes, sometimes even hours, then they got "terminated -
>> forced by administrator" (ANR0483W), e.g.:
>>
>> 05/13/2019 15:23:16 ANR2017I Administrator GK issued command:
>> REPLICATE NODE vsbck (SESSION: 26128)
>> 05/13/2019 15:23:16 ANR1626I The previous message (message number
>> 2017) was repeated 1 times.
>> 05/13/2019 15:23:16 ANR0984I Process 494 for Replicate Node started
>> in the BACKGROUND at 15:23:16. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:16 ANR2110I REPLICATE NODE started as process 494.
>> (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:16 ANR0408I Session 26184 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:16 ANR0408I Session 26185 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:16 ANR0408I Session 26186 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:17 ANR0408I Session 26187 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:17 ANR0408I Session 26188 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:17 ANR0408I Session 26189 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:17 ANR0408I Session 26190 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>> 05/13/2019 15:23:17 ANR0408I Session 26191 started for server SM283
>> (Linux/x86_64) (Tcp/Ip) for replication. (SESSION: 26128, PROCESS: 494)
>>
>> 05/13/2019 15:24:57 ANR0483W Session 26187 for node SM283
>> (Linux/x86_64) terminated - forced by administrator. (SESSION: 26128,
>> PROCESS: 494)
>>
>> on the target server we observe at that time:
>>
>> 13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error
>> 104.
>> 13.05.2019 15:25:51 ANR3178E A communication error occurred during
>> session 65294 with replication server TSM.
>> 13.05.2019 15:25:51 ANR0479W Session 65294 for server TSM (Windows)
>> terminated - connection with server severed.
>> 13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error 32.
>>
>> => Any idea why this replication aborts?
>>
>> => why is there a "socket abortion error"?
>>
>>
>> well, we already opened a SR case, send lots of logs and traces. as IBM
>> suspects a network problem, now both serves use a cross link connection
>> without nothing but NIC/GBICs, plugs and wires.
>>
>> thanks & best
>>
>> Bjørn
This message was imported via the External PhorumMail Module
Sorry, only registered users may post in this forum.

Click here to login