Discussion:
[OMPI users] IpV6 Openmpi mpirun failed
Mukkie
2017-10-18 18:18:27 UTC
Permalink
Hi,

I have two ipv6 only machines, I configured/built OMPI version 3.0 with -
-enable-ipv6

I want to verify a simple MPI communication call through tcp ip between
these two machines. I am using ring_c and connectivity_c examples.



Issuing from one of the host machine


[***@ipv-rhel73 examples]$ mpirun -hostfile host --mca btl tcp,self
--mca oob_base_verbose 100 ring_c

.
.

[ipv-rhel71a.locallab.local:10822] [[5331,0],1] tcp_peer_send_blocking:
send() to socket 20 failed: Broken pipe (32)


where “host” contains the ipv6 address of the remote machine (namely –
‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.



I will attach a verbose output in the follow-up post.

Thanks.



Cordially,



*Mukundhan Selvam*

Development Engineer, HPC

[image: MSC Software] <http://www.mscsoftware.com/>

4675 MacArthur Court, Newport Beach, CA 92660

714-540-8900 ext. 4166
Mukkie
2017-10-18 20:29:14 UTC
Permalink
Adding a verbose output. Please check for failed and advise. Thank you.

[***@ipv-rhel73 examples]$ mpirun -hostfile host --mca oob_base_verbose
100 --mca btl tcp,self ring_c
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
[ipv-rhel73:10575] mca: base: components_register: registering framework
oob components
[ipv-rhel73:10575] mca: base: components_register: found loaded component
tcp
[ipv-rhel73:10575] mca: base: components_register: component tcp register
function successful
[ipv-rhel73:10575] mca: base: components_open: opening oob components
[ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_open: component tcp open function
successful
[ipv-rhel73:10575] mca:oob:select: checking available component tcp
[ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
[ipv-rhel73:10575] oob:tcp: component_available called
[ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
[ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface
lo
[ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
[ipv-rhel73:10575] mca:oob:select: Adding component to end
[ipv-rhel73:10575] mca:oob:select: Found 1 active transports
[ipv-rhel73:10575] [[20058,0],0]: get transports
[ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
[ipv-rhel71a.locallab.local:12299] mca: base: components_register:
registering framework oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: found
loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_register:
component tcp register function successful
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob
components
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded
component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: component
tcp open function successful
[ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available
component tcp
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
[ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2
FAMILY: V6
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
fe80::226:b9ff:fe85:6a28 to our list of V6 connections
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1
FAMILY: V4
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
loopback interface lo
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4
port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6
port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for
component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri
1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if peer
[[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer
[[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER
[[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer
[[20058,0],0] is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is
reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] OOB_SEND:
rml_oob_send.c:265
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:base:send to target
[[20058,0],0] - attempt 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:send_nb to peer
[[20058,0],0]:10 seq = -1
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:204] processing
send to peer [[20058,0],0]:10 seq_num = -1 via [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:225] queue
pending to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:send_nb: initiating
connection to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:239] connect to
[[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
attempting to connect to proc [[20058,0],0] on socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
attempting to connect to proc [[20058,0],0] on (null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes
to socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_send_blocking:
send() to socket 20 failed: Broken pipe (32)
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_close for
[[20058,0],0] sd 20 state FAILED
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp_connection.c:356]
connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:lost connection called
for peer [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
attempting to connect to proc [[20058,0],0] on socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
attempting to connect to proc [[20058,0],0] on (null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect:
Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes
to socket 20
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN done
[ipv-rhel73:10575] mca: base: close: component tcp closed
[ipv-rhel73:10575] mca: base: close: unloading component tcp

Cordially,
Muku.
Post by Mukkie
Hi,
I have two ipv6 only machines, I configured/built OMPI version 3.0 with -
-enable-ipv6
I want to verify a simple MPI communication call through tcp ip between
these two machines. I am using ring_c and connectivity_c examples.
Issuing from one of the host machine

--mca oob_base_verbose 100 ring_c
.
.
send() to socket 20 failed: Broken pipe (32)
where “host” contains the ipv6 address of the remote machine (namely –
‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.
I will attach a verbose output in the follow-up post.
Thanks.
Cordially,
*Mukundhan Selvam*
Development Engineer, HPC
[image: MSC Software] <http://www.mscsoftware.com/>
4675 MacArthur Court, Newport Beach, CA 92660
714-540-8900 <(714)%20540-8900> ext. 4166
r***@open-mpi.org
2017-10-18 21:38:50 UTC
Permalink
Looks like there is a firewall or something blocking communication between those nodes?
Post by Mukkie
Adding a verbose output. Please check for failed and advise. Thank you.
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
[ipv-rhel73:10575] mca: base: components_register: registering framework oob components
[ipv-rhel73:10575] mca: base: components_register: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_register: component tcp register function successful
[ipv-rhel73:10575] mca: base: components_open: opening oob components
[ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_open: component tcp open function successful
[ipv-rhel73:10575] mca:oob:select: checking available component tcp
[ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
[ipv-rhel73:10575] oob:tcp: component_available called
[ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
[ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface lo
[ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
[ipv-rhel73:10575] mca:oob:select: Adding component to end
[ipv-rhel73:10575] mca:oob:select: Found 1 active transports
[ipv-rhel73:10575] [[20058,0],0]: get transports
[ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: registering framework oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: found loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: component tcp register function successful
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: component tcp open function successful
[ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available component tcp
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
[ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding fe80::226:b9ff:fe85:6a28 to our list of V6 connections
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting loopback interface lo
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri 1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if peer [[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer [[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer [[20058,0],0] is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] OOB_SEND: rml_oob_send.c:265
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:base:send to target [[20058,0],0] - attempt 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:send_nb to peer [[20058,0],0]:10 seq = -1
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:204] processing send to peer [[20058,0],0]:10 seq_num = -1 via [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:225] queue pending to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:send_nb: initiating connection to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:239] connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on (null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_send_blocking: send() to socket 20 failed: Broken pipe (32)
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_close for [[20058,0],0] sd 20 state FAILED
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp_connection.c:356] connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:lost connection called for peer [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on (null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN done
[ipv-rhel73:10575] mca: base: close: component tcp closed
[ipv-rhel73:10575] mca: base: close: unloading component tcp
Cordially,
Muku.
Hi,
I have two ipv6 only machines, I configured/built OMPI version 3.0 with - -enable-ipv6
I want to verify a simple MPI communication call through tcp ip between these two machines. I am using ring_c and connectivity_c examples.
Issuing from one of the host machine

.
.
[ipv-rhel71a.locallab.local:10822] [[5331,0],1] tcp_peer_send_blocking: send() to socket 20 failed: Broken pipe (32)
where “host” contains the ipv6 address of the remote machine (namely – ‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.
I will attach a verbose output in the follow-up post.
Thanks.
Cordially,
Mukundhan Selvam
Development Engineer, HPC
<http://www.mscsoftware.com/>
4675 MacArthur Court, Newport Beach, CA 92660
714-540-8900 <tel:(714)%20540-8900> ext. 4166
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Mukkie
2017-10-18 22:52:04 UTC
Permalink
Thanks for your suggestion. However my firewall's are already disabled on
both the machines.

Cordially,
Muku.
Post by r***@open-mpi.org
Looks like there is a firewall or something blocking communication between those nodes?
Adding a verbose output. Please check for failed and advise. Thank you.
oob_base_verbose 100 --mca btl tcp,self ring_c
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
[ipv-rhel73:10575] mca: base: components_register: registering framework oob components
[ipv-rhel73:10575] mca: base: components_register: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_register: component tcp register
function successful
[ipv-rhel73:10575] mca: base: components_open: opening oob components
[ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_open: component tcp open function successful
[ipv-rhel73:10575] mca:oob:select: checking available component tcp
[ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
[ipv-rhel73:10575] oob:tcp: component_available called
[ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
[ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface lo
[ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
[ipv-rhel73:10575] mca:oob:select: Adding component to end
[ipv-rhel73:10575] mca:oob:select: Found 1 active transports
[ipv-rhel73:10575] [[20058,0],0]: get transports
[ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
registering framework oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: found
loaded component tcp
component tcp register function successful
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: component
tcp open function successful
[ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available component tcp
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
[ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
fe80::226:b9ff:fe85:6a28 to our list of V6 connections
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
loopback interface lo
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri
1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if
peer [[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer
[[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer
[[20058,0],0] is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is
reachable via component tcp
rml_oob_send.c:265
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:base:send to target
[[20058,0],0] - attempt 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:send_nb to peer
[[20058,0],0]:10 seq = -1
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:204]
processing send to peer [[20058,0],0]:10 seq_num = -1 via [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:225] queue
pending to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:send_nb: initiating
connection to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:239] connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
(null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
send() to socket 20 failed: Broken pipe (32)
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_close for
[[20058,0],0] sd 20 state FAILED
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp_connection.c:356]
connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:lost connection
called for peer [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
(null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN done
[ipv-rhel73:10575] mca: base: close: component tcp closed
[ipv-rhel73:10575] mca: base: close: unloading component tcp
Cordially,
Muku.
Post by Mukkie
Hi,
I have two ipv6 only machines, I configured/built OMPI version 3.0 with - -enable-ipv6
I want to verify a simple MPI communication call through tcp ip between
these two machines. I am using ring_c and connectivity_c examples.
Issuing from one of the host machine

--mca oob_base_verbose 100 ring_c
.
.
send() to socket 20 failed: Broken pipe (32)
where “host” contains the ipv6 address of the remote machine (namely –
‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.
I will attach a verbose output in the follow-up post.
Thanks.
Cordially,
*Mukundhan Selvam*
Development Engineer, HPC
[image: MSC Software] <http://www.mscsoftware.com/>
4675 MacArthur Court, Newport Beach, CA 92660
714-540-8900 <(714)%20540-8900> ext. 4166
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Mukkie
2017-10-19 22:57:35 UTC
Permalink
FWIW, my issue is related to this one.
https://github.com/open-mpi/ompi/issues/1585

I have version 3.0.0 and the above issue is closed saying, fixes went into
3.1.0
However, i don't see the code changes towards this issue.?

Cordially,
Muku.
Post by Mukkie
Thanks for your suggestion. However my firewall's are already disabled on
both the machines.
Cordially,
Muku.
Post by r***@open-mpi.org
Looks like there is a firewall or something blocking communication between those nodes?
Adding a verbose output. Please check for failed and advise. Thank you.
oob_base_verbose 100 --mca btl tcp,self ring_c
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
[ipv-rhel73:10575] mca: base: components_register: registering framework oob components
[ipv-rhel73:10575] mca: base: components_register: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_register: component tcp register
function successful
[ipv-rhel73:10575] mca: base: components_open: opening oob components
[ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_open: component tcp open function successful
[ipv-rhel73:10575] mca:oob:select: checking available component tcp
[ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
[ipv-rhel73:10575] oob:tcp: component_available called
[ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
[ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface lo
[ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
[ipv-rhel73:10575] mca:oob:select: Adding component to end
[ipv-rhel73:10575] mca:oob:select: Found 1 active transports
[ipv-rhel73:10575] [[20058,0],0]: get transports
[ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
registering framework oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: found
loaded component tcp
component tcp register function successful
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: component
tcp open function successful
[ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available component tcp
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
[ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
fe80::226:b9ff:fe85:6a28 to our list of V6 connections
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
loopback interface lo
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri
1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if
peer [[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer
[[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer
[[20058,0],0] is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is
reachable via component tcp
rml_oob_send.c:265
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:base:send to target
[[20058,0],0] - attempt 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:send_nb to peer
[[20058,0],0]:10 seq = -1
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:204]
processing send to peer [[20058,0],0]:10 seq_num = -1 via [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:225] queue
pending to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:send_nb: initiating
connection to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:239] connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
(null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
send() to socket 20 failed: Broken pipe (32)
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_close for
[[20058,0],0] sd 20 state FAILED
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp_connection.c:356]
connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:lost connection
called for peer [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
(null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
------------------------------------------------------------
--------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
------------------------------------------------------------
--------------
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN done
[ipv-rhel73:10575] mca: base: close: component tcp closed
[ipv-rhel73:10575] mca: base: close: unloading component tcp
Cordially,
Muku.
Post by Mukkie
Hi,
I have two ipv6 only machines, I configured/built OMPI version 3.0 with - -enable-ipv6
I want to verify a simple MPI communication call through tcp ip between
these two machines. I am using ring_c and connectivity_c examples.
Issuing from one of the host machine

tcp,self --mca oob_base_verbose 100 ring_c
.
.
send() to socket 20 failed: Broken pipe (32)
where “host” contains the ipv6 address of the remote machine (namely –
‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.
I will attach a verbose output in the follow-up post.
Thanks.
Cordially,
*Mukundhan Selvam*
Development Engineer, HPC
[image: MSC Software] <http://www.mscsoftware.com/>
4675 MacArthur Court, Newport Beach, CA 92660
714-540-8900 <(714)%20540-8900> ext. 4166
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2017-10-19 23:37:29 UTC
Permalink
Actually, I don’t see any related changes in OMPI master, let alone the branches. So far as I can tell, the author never actually submitted the work.
Post by Mukkie
FWIW, my issue is related to this one.
https://github.com/open-mpi/ompi/issues/1585 <https://github.com/open-mpi/ompi/issues/1585>
I have version 3.0.0 and the above issue is closed saying, fixes went into 3.1.0
However, i don't see the code changes towards this issue.?
Cordially,
Muku.
Thanks for your suggestion. However my firewall's are already disabled on both the machines.
Cordially,
Muku.
Looks like there is a firewall or something blocking communication between those nodes?
Post by Mukkie
Adding a verbose output. Please check for failed and advise. Thank you.
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
[ipv-rhel73:10575] mca: base: components_register: registering framework oob components
[ipv-rhel73:10575] mca: base: components_register: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_register: component tcp register function successful
[ipv-rhel73:10575] mca: base: components_open: opening oob components
[ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_open: component tcp open function successful
[ipv-rhel73:10575] mca:oob:select: checking available component tcp
[ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
[ipv-rhel73:10575] oob:tcp: component_available called
[ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
[ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface lo
[ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
[ipv-rhel73:10575] mca:oob:select: Adding component to end
[ipv-rhel73:10575] mca:oob:select: Found 1 active transports
[ipv-rhel73:10575] [[20058,0],0]: get transports
[ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or directory (ignored)
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: registering framework oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: found loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_register: component tcp register function successful
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: found loaded component tcp
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: component tcp open function successful
[ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available component tcp
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
[ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding fe80::226:b9ff:fe85:6a28 to our list of V6 connections
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting loopback interface lo
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri 1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370 <>
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if peer [[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer [[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370 <>
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer [[20058,0],0] is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] OOB_SEND: rml_oob_send.c:265
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:base:send to target [[20058,0],0] - attempt 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:send_nb to peer [[20058,0],0]:10 seq = -1
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:204] processing send to peer [[20058,0],0]:10 seq_num = -1 via [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:225] queue pending to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:send_nb: initiating connection to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:239] connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on (null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_send_blocking: send() to socket 20 failed: Broken pipe (32)
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_close for [[20058,0],0] sd 20 state FAILED
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp_connection.c:356] connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:lost connection called for peer [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on (null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN done
[ipv-rhel73:10575] mca: base: close: component tcp closed
[ipv-rhel73:10575] mca: base: close: unloading component tcp
Cordially,
Muku.
Hi,
I have two ipv6 only machines, I configured/built OMPI version 3.0 with - -enable-ipv6
I want to verify a simple MPI communication call through tcp ip between these two machines. I am using ring_c and connectivity_c examples.
Issuing from one of the host machine

.
.
[ipv-rhel71a.locallab.local:10822] [[5331,0],1] tcp_peer_send_blocking: send() to socket 20 failed: Broken pipe (32)
where “host” contains the ipv6 address of the remote machine (namely – ‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.
I will attach a verbose output in the follow-up post.
Thanks.
Cordially,
Mukundhan Selvam
Development Engineer, HPC
<http://www.mscsoftware.com/>
4675 MacArthur Court, Newport Beach, CA 92660
714-540-8900 <tel:(714)%20540-8900> ext. 4166
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Mukkie
2017-10-20 00:03:25 UTC
Permalink
Ok, Thanks. In that case, can we reopen this issue then, to get an update
from the participants.?


Cordially,
Muku.
Post by r***@open-mpi.org
Actually, I don’t see any related changes in OMPI master, let alone the
branches. So far as I can tell, the author never actually submitted the
work.
FWIW, my issue is related to this one.
https://github.com/open-mpi/ompi/issues/1585
I have version 3.0.0 and the above issue is closed saying, fixes went into 3.1.0
However, i don't see the code changes towards this issue.?
Cordially,
Muku.
Post by Mukkie
Thanks for your suggestion. However my firewall's are already disabled on
both the machines.
Cordially,
Muku.
Post by r***@open-mpi.org
Looks like there is a firewall or something blocking communication between those nodes?
Adding a verbose output. Please check for failed and advise. Thank you.
oob_base_verbose 100 --mca btl tcp,self ring_c
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_plm_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
[ipv-rhel73:10575] mca: base: components_register: registering framework oob components
[ipv-rhel73:10575] mca: base: components_register: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_register: component tcp
register function successful
[ipv-rhel73:10575] mca: base: components_open: opening oob components
[ipv-rhel73:10575] mca: base: components_open: found loaded component tcp
[ipv-rhel73:10575] mca: base: components_open: component tcp open function successful
[ipv-rhel73:10575] mca:oob:select: checking available component tcp
[ipv-rhel73:10575] mca:oob:select: Querying component [tcp]
[ipv-rhel73:10575] oob:tcp: component_available called
[ipv-rhel73:10575] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init adding
fe80::b9b:ac5d:9cf0:b858 to our list of V6 connections
[ipv-rhel73:10575] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] oob:tcp:init rejecting loopback interface lo
[ipv-rhel73:10575] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[ipv-rhel73:10575] [[20058,0],0] TCP STARTUP
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv4 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv4 port 53438
[ipv-rhel73:10575] [[20058,0],0] attempting to bind to IPv6 port 0
[ipv-rhel73:10575] [[20058,0],0] assigned IPv6 port 43370
[ipv-rhel73:10575] mca:oob:select: Adding component to end
[ipv-rhel73:10575] mca:oob:select: Found 1 active transports
[ipv-rhel73:10575] [[20058,0],0]: get transports
[ipv-rhel73:10575] [[20058,0],0]:get transports for component tcp
[ipv-rhel73:10575] mca_base_component_repository_open: unable to open
mca_ras_tm: libtorque.so.2: cannot open shared object file: No such file or
directory (ignored)
registering framework oob components
found loaded component tcp
component tcp register function successful
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: opening oob components
[ipv-rhel71a.locallab.local:12299] mca: base: components_open: found
loaded component tcp
component tcp open function successful
[ipv-rhel71a.locallab.local:12299] mca:oob:select: checking available component tcp
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Querying component [tcp]
[ipv-rhel71a.locallab.local:12299] oob:tcp: component_available called
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 1 KERNEL INDEX 2 FAMILY: V6
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init adding
fe80::226:b9ff:fe85:6a28 to our list of V6 connections
[ipv-rhel71a.locallab.local:12299] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:init rejecting
loopback interface lo
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] TCP STARTUP
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv4 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv4 port 50782
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] attempting to bind to IPv6 port 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] assigned IPv6 port 59268
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Adding component to end
[ipv-rhel71a.locallab.local:12299] mca:oob:select: Found 1 active transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: get transports
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:get transports for component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: set_addr to uri
1314521088.0;tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:set_addr checking if
peer [[20058,0],0] is reachable via component tcp
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp: working peer
[[20058,0],0] address tcp6://[fe80::b9b:ac5d:9cf0:b858]:43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SET_PEER ADDING PEER [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] set_peer: peer
[[20058,0],0] is listening on net fe80::b9b:ac5d:9cf0:b858 port 43370
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]: peer [[20058,0],0] is
reachable via component tcp
rml_oob_send.c:265
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:base:send to
target [[20058,0],0] - attempt 0
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] oob:tcp:send_nb to
peer [[20058,0],0]:10 seq = -1
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:204]
processing send to peer [[20058,0],0]:10 seq_num = -1 via [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:225] queue
pending to [[20058,0],0]
initiating connection to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp.c:239]
connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
(null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
tcp_peer_send_blocking: send() to socket 20 failed: Broken pipe (32)
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp_peer_close for
[[20058,0],0] sd 20 state FAILED
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]:[oob_tcp_connection.c:356]
connect to [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] tcp:lost connection
called for peer [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0]
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
socket 20
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: attempting to connect to proc [[20058,0],0] on
(null):-1 - 0 retries
[ipv-rhel71a.locallab.local:12299] [[20058,0],1]
orte_tcp_peer_try_connect: Connection to proc [[20058,0],0] succeeded
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] SEND CONNECT ACK
[ipv-rhel71a.locallab.local:12299] [[20058,0],1] send blocking of 72 bytes to socket 20
------------------------------------------------------------
--------------
ORTE was unable to reliably start one or more daemons.
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
------------------------------------------------------------
--------------
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN
[ipv-rhel73:10575] [[20058,0],0] TCP SHUTDOWN done
[ipv-rhel73:10575] mca: base: close: component tcp closed
[ipv-rhel73:10575] mca: base: close: unloading component tcp
Cordially,
Muku.
Post by Mukkie
Hi,
I have two ipv6 only machines, I configured/built OMPI version 3.0 with
- -enable-ipv6
I want to verify a simple MPI communication call through tcp ip between
these two machines. I am using ring_c and connectivity_c examples.
Issuing from one of the host machine

tcp,self --mca oob_base_verbose 100 ring_c
.
.
[ipv-rhel71a.locallab.local:10822] [[5331,0],1]
tcp_peer_send_blocking: send() to socket 20 failed: Broken pipe (32)
where “host” contains the ipv6 address of the remote machine (namely –
‘ipv-rhel71a’). Also I have passwordless ssh setup to the remote machine.
I will attach a verbose output in the follow-up post.
Thanks.
Cordially,
*Mukundhan Selvam*
Development Engineer, HPC
[image: MSC Software] <http://www.mscsoftware.com/>
4675 MacArthur Court, Newport Beach, CA 92660
714-540-8900 <(714)%20540-8900> ext. 4166
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...