Discussion:
[OMPI users] Problems with IPoIB and Openib
Allan Overstreet
2017-05-26 22:28:24 UTC
Permalink
I have been having some issues with using openmpi with tcp over IPoIB
and openib. The problems arise when I run a program that uses basic
collective communication. The two programs that I have been using are
attached.

*** IPoIB ***

The mpirun command I am using to run mpi over IPoIB is,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./avg 8000

This program will appear to run on the nodes, but will sit at 100% CPU
and use no memory. On the host node an error will be printed,

[sm1][[58411,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.3 failed: No route to host (113)

Using another program,

mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./congrad 800
Produces the following result. This program will also run on the nodes
sm1, sm2, sm3, and sm4 at 100% and use no memory.
[sm3][[61383,1],4][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],6][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],3][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.2 failed: No route to host (113)
[sm3][[61383,1],5][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.2 failed: No route to host (113)
[sm1][[61383,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.3 failed: No route to host (113)
[sm1][[61383,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.1.0.3 failed: No route to host (113)

*** openib ***

Running the congrad program over openib will produce the result,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_include
10.1.0.0/24 -hostfile nodes -np 8 ./avg 800
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm2.overst.local
Framework: btl
Component: openib
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send()
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],1] usock_peer_accept:
usock_peer_send_connect_ack failed
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send()
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],0] usock_peer_accept:
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[smd:31760] 4 more processes have sent help message help-mca-base.txt /
find-available:not-valid
[smd:31760] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[smd:31760] 4 more processes have sent help message help-mpi-runtime.txt
/ mpi_init:startup:internal-failure
=== Later errors printed out on the host node ===
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: sm3
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: sm1
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: sm2
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: sm4
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
The ./avg process was not created on any of the nodes.
Running the ./congrad program,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_include
10.1.0.0/24 -hostfile nodes -np 8 ./congrad 800
Will results in the following errors,

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm3.overst.local
Framework: btl
Component: openib
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send()
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32271] [[57834,0],1]-[[57834,1],0] usock_peer_accept:
usock_peer_send_connect_ack failed
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send()
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32271] [[57834,0],1]-[[57834,1],1] usock_peer_accept:
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[smd:32088] 5 more processes have sent help message help-mca-base.txt /
find-available:not-valid
[smd:32088] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[smd:32088] 5 more processes have sent help message help-mpi-runtime.txt
/ mpi_init:startup:internal-failure

Using these mpirun commands will successfully run using a test program
that is only using point to point communication.

The nodes are interconnected in the following way.

__ _
[__]|=|
/::/|_|
HOST: smd
Dual 1Gb Ethernet Bonded
.-------------> Bond0 IP: 192.168.1.200
| Infiniband Card: MHQH29B-XTR <------------.
| Ib0 IP: 10.1.0.1 |
| OS: Ubuntu Mate |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: sm1 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.196 |
| Infiniband Card: QLOGIC QLE7340 <---------|
| Ib0 IP: 10.1.0.2 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
|---------. /::/|_| |
| | HOST: sm2 |
| | Dual 1Gb Ethernet Bonded |
| '---> Bond0 IP: 192.168.1.199 |
__________ Infiniband Card: QLOGIC QLE7340 __________
[_|||||||_°] Ib0 IP: 10.1.0.3 [_|||||||_°]
[_|||||||_°] OS: Centos 7 Minimal [_|||||||_°]
[_|||||||_°] __ _ [_|||||||_°]
Gb Ethernet Switch [__]|=| Voltaire 4036
QDR Switch
| /::/|_| |
| HOST: sm3 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.203 |
| Infiniband Card: QLOGIC QLE7340 <----------|
| Ib0 IP: 10.1.0.4 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: sm4 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.204 |
| Infiniband Card: QLOGIC QLE7340 <----------|
| Ib0 IP: 10.1.0.5 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: dl580 |
| Dual 1Gb Ethernet Bonded |
'-------------> Bond0 IP: 192.168.1.201 |
Infiniband Card: QLOGIC QLE7340 <----------'
Ib0 IP: 10.1.0.6
OS: Centos 7 Minimal

Thanks for the help again.

Sincerely,

Allan Overstreet
g***@rist.or.jp
2017-05-27 15:25:27 UTC
Permalink
Allan,

about IPoIB, the error message (no route to host) is very puzzling.
did you double check IPoIB is ok between all nodes ?
this error message suggests IPoIB is not working between sm3 and sm4,
this could be caused by the subnet manager, or a firewall.
ping is the first tool you should use to test that, then you can use nc
(netcat).
for example, on sm4
nc -l 1234
on sm3
echo hello | nc 10.1.0.5 1234
(expected result: "hello" should be displayed on sm4)

about openib, you first need to double check the btl/openib was built.
assuming you did not configure with --disable-dlopen, you should have a
mca_btl_openib.so
file in /.../lib/openmpi. it should be accessible by the user, and
ldd /.../lib/openmpi/mca_btl_openib.so
should not have any unresolved dependencies on *all* your nodes

Cheers,

Gilles

----- Original Message -----
Post by Allan Overstreet
I have been having some issues with using openmpi with tcp over IPoIB
and openib. The problems arise when I run a program that uses basic
collective communication. The two programs that I have been using are
attached.
*** IPoIB ***
The mpirun command I am using to run mpi over IPoIB is,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./avg 8000
This program will appear to run on the nodes, but will sit at 100% CPU
and use no memory. On the host node an error will be printed,
[sm1][[58411,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
Using another program,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./congrad 800
Produces the following result. This program will also run on the nodes
sm1, sm2, sm3, and sm4 at 100% and use no memory.
[sm3][[61383,1],4][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],6][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],3][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.2 failed: No route to host (113)
[sm3][[61383,1],5][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.2 failed: No route to host (113)
[sm1][[61383,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
[sm1][[61383,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
*** openib ***
Running the congrad program over openib will produce the result,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 -hostfile nodes -np 8 ./avg 800
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
----------------------------------------------------------------------
----
Post by Allan Overstreet
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm2.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
Post by Allan Overstreet
----------------------------------------------------------------------
----
Post by Allan Overstreet
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
Post by Allan Overstreet
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[smd:31760] 4 more processes have sent help message help-mca-base.txt /
find-available:not-valid
[smd:31760] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[smd:31760] 4 more processes have sent help message help-mpi-runtime.
txt
Post by Allan Overstreet
/ mpi_init:startup:internal-failure
=== Later errors printed out on the host node ===
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm3
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm1
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm2
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm4
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
The ./avg process was not created on any of the nodes.
Running the ./congrad program,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 -hostfile nodes -np 8 ./congrad 800
Will results in the following errors,
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
----------------------------------------------------------------------
----
Post by Allan Overstreet
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm3.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
Post by Allan Overstreet
----------------------------------------------------------------------
----
Post by Allan Overstreet
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
Post by Allan Overstreet
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[smd:32088] 5 more processes have sent help message help-mca-base.txt /
find-available:not-valid
[smd:32088] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[smd:32088] 5 more processes have sent help message help-mpi-runtime.
txt
Post by Allan Overstreet
/ mpi_init:startup:internal-failure
Using these mpirun commands will successfully run using a test program
that is only using point to point communication.
The nodes are interconnected in the following way.
__ _
[__]|=|
/::/|_|
HOST: smd
Dual 1Gb Ethernet Bonded
.-------------> Bond0 IP: 192.168.1.200
| Infiniband Card: MHQH29B-XTR <------------.
| Ib0 IP: 10.1.0.1 |
| OS: Ubuntu Mate |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: sm1 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.196
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.2 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
|---------. /::/|_| |
| | HOST: sm2 |
| | Dual 1Gb Ethernet Bonded |
| '---> Bond0 IP: 192.168.1.199
|
Post by Allan Overstreet
__________ Infiniband Card: QLOGIC QLE7340 __________
[_|||||||_°] Ib0 IP: 10.1.0.3 [_|||||||_°]
[_|||||||_°] OS: Centos 7 Minimal [_|||||||_°]
[_|||||||_°] __ _ [_|||||||_°]
Gb Ethernet Switch [__]|=| Voltaire 4036
QDR Switch
| /::/|_| |
| HOST: sm3
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
|-------------> Bond0 IP: 192.168.1.203
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
-|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.4
|
Post by Allan Overstreet
| OS: Centos 7 Minimal
|
Post by Allan Overstreet
| __ _
|
Post by Allan Overstreet
| [__]|=| |
| /::/|_| |
| HOST: sm4
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
|-------------> Bond0 IP: 192.168.1.204
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
-|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.5
|
Post by Allan Overstreet
| OS: Centos 7 Minimal
|
Post by Allan Overstreet
| __ _
|
Post by Allan Overstreet
| [__]|=| |
| /::/|_| |
| HOST: dl580
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
'-------------> Bond0 IP: 192.168.1.201
|
Post by Allan Overstreet
Infiniband Card: QLOGIC QLE7340 <---------
-'
Post by Allan Overstreet
Ib0 IP: 10.1.0.6
OS: Centos 7 Minimal
Thanks for the help again.
Sincerely,
Allan Overstreet
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-05-29 04:53:35 UTC
Permalink
Allan,


the "No route to host" error indicates there is something going wrong
with IPoIB on your cluster

(and Open MPI is not involved whatsoever in that)

on sm3 and sm4, you can run

/sbin/ifconfig

brctl show

iptables -L

iptables -t nat -L

we might be able to figure out what is going wrong from that.


if there is no mca_btl_openib.so component, it is likely the infiniband
headers are not available on the node you compiled Open MPI.

i guess if you configure Open MPI with

--with-verbs

it will abort if the headers are not found.

in this case, simply install them and rebuild Open MPI.

if you are unsure about that part, please compress and post your
config.log so we can have a look at it


Cheers,


gilles
Gilles,
I was able to ping sm4 from sm3 and sm3 from sm4. However running
netcat from sm4 and sm5 using the commands.
and
Ncat: No route to host.
Testing this on other nodes,
and
Ncat: No route to host.
These nodes do not have firewalls installed, so I am confused why this
traffic isn't getting through.
I am compiling openmpi from source and the shared library
/home/allan/software/openmpi/install/lib/openmpi/mca_btl_openib.so
doesn't exist.
Post by g***@rist.or.jp
Allan,
about IPoIB, the error message (no route to host) is very puzzling.
did you double check IPoIB is ok between all nodes ?
this error message suggests IPoIB is not working between sm3 and sm4,
this could be caused by the subnet manager, or a firewall.
ping is the first tool you should use to test that, then you can use nc
(netcat).
for example, on sm4
nc -l 1234
on sm3
echo hello | nc 10.1.0.5 1234
(expected result: "hello" should be displayed on sm4)
about openib, you first need to double check the btl/openib was built.
assuming you did not configure with --disable-dlopen, you should have a
mca_btl_openib.so
file in /.../lib/openmpi. it should be accessible by the user, and
ldd /.../lib/openmpi/mca_btl_openib.so
should not have any unresolved dependencies on *all* your nodes
Cheers,
Gilles
----- Original Message -----
Post by Allan Overstreet
I have been having some issues with using openmpi with tcp over IPoIB
and openib. The problems arise when I run a program that uses basic
collective communication. The two programs that I have been using are
attached.
*** IPoIB ***
The mpirun command I am using to run mpi over IPoIB is,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./avg 8000
This program will appear to run on the nodes, but will sit at 100% CPU
and use no memory. On the host node an error will be printed,
[sm1][[58411,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
Using another program,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./congrad 800
Produces the following result. This program will also run on the nodes
sm1, sm2, sm3, and sm4 at 100% and use no memory.
[sm3][[61383,1],4][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],6][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],3][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.2 failed: No route to host (113)
[sm3][[61383,1],5][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.2 failed: No route to host (113)
[sm1][[61383,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
[sm1][[61383,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
*** openib ***
Running the congrad program over openib will produce the result,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 -hostfile nodes -np 8 ./avg 800
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
----------------------------------------------------------------------
----
Post by Allan Overstreet
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm2.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
Post by Allan Overstreet
----------------------------------------------------------------------
----
Post by Allan Overstreet
It looks like MPI_INIT failed for some reason; your parallel process
is
Post by Allan Overstreet
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
Post by Allan Overstreet
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
Post by Allan Overstreet
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[smd:31760] 4 more processes have sent help message help-mca-base.txt
/
Post by Allan Overstreet
find-available:not-valid
[smd:31760] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all
Post by Allan Overstreet
help / error messages
[smd:31760] 4 more processes have sent help message help-mpi-runtime.
txt
Post by Allan Overstreet
/ mpi_init:startup:internal-failure
=== Later errors printed out on the host node ===
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm3
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm1
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm2
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm4
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
The ./avg process was not created on any of the nodes.
Running the ./congrad program,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 -hostfile nodes -np 8 ./congrad 800
Will results in the following errors,
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
----------------------------------------------------------------------
----
Post by Allan Overstreet
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm3.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
Post by Allan Overstreet
----------------------------------------------------------------------
----
Post by Allan Overstreet
It looks like MPI_INIT failed for some reason; your parallel process
is
Post by Allan Overstreet
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
Post by Allan Overstreet
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
Post by Allan Overstreet
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[smd:32088] 5 more processes have sent help message help-mca-base.txt
/
Post by Allan Overstreet
find-available:not-valid
[smd:32088] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all
Post by Allan Overstreet
help / error messages
[smd:32088] 5 more processes have sent help message help-mpi-runtime.
txt
Post by Allan Overstreet
/ mpi_init:startup:internal-failure
Using these mpirun commands will successfully run using a test program
that is only using point to point communication.
The nodes are interconnected in the following way.
__ _
[__]|=|
/::/|_|
HOST: smd
Dual 1Gb Ethernet Bonded
.-------------> Bond0 IP: 192.168.1.200
| Infiniband Card: MHQH29B-XTR
<------------.
| Ib0 IP: 10.1.0.1 |
| OS: Ubuntu Mate |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: sm1 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.196
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.2 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
|---------. /::/|_| |
| | HOST: sm2 |
| | Dual 1Gb Ethernet Bonded |
| '---> Bond0 IP: 192.168.1.199
|
Post by Allan Overstreet
__________ Infiniband Card: QLOGIC QLE7340 __________
[_|||||||_°] Ib0 IP: 10.1.0.3 [_|||||||_°]
[_|||||||_°] OS: Centos 7 Minimal [_|||||||_°]
[_|||||||_°] __ _ [_|||||||_°]
Gb Ethernet Switch [__]|=| Voltaire
4036
Post by Allan Overstreet
QDR Switch
| /::/|_| |
| HOST: sm3
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
|-------------> Bond0 IP: 192.168.1.203
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
-|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.4
|
Post by Allan Overstreet
| OS: Centos 7 Minimal
|
Post by Allan Overstreet
| __ _
|
Post by Allan Overstreet
| [__]|=| |
| /::/|_| |
| HOST: sm4
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
|-------------> Bond0 IP: 192.168.1.204
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
-|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.5
|
Post by Allan Overstreet
| OS: Centos 7 Minimal
|
Post by Allan Overstreet
| __ _
|
Post by Allan Overstreet
| [__]|=| |
| /::/|_| |
| HOST: dl580
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
'-------------> Bond0 IP: 192.168.1.201
|
Post by Allan Overstreet
Infiniband Card: QLOGIC QLE7340 <---------
-'
Post by Allan Overstreet
Ib0 IP: 10.1.0.6
OS: Centos 7 Minimal
Thanks for the help again.
Sincerely,
Allan Overstreet
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-05-30 01:23:13 UTC
Permalink
Alan,


note you do not have to use the *-ib hostnames in your host_file

these are only used for SSH, so since oob/tcp is running on your
ethernet network, i guess you really want to use sm3 and sm4 host names.


did you also run the same netcat test but in the other direction ?
do you run 'mpirun' on sm3 ?

here are a few tests you can perform
- run tcp over ethernet
mpirun --mca btl_tcp_if_include 192.168.1.0/24 ...
- run all 4 tasks on sm3 (host_file contains one line "sm3 slots=4")
with tcp (e.g. --mca btl tcp,self)
- run with verbose oob and tcp
mpirun --mca btl_base_verbose 100 --mca oob_base_verbose 100 ...

when your app hangs, you can manually run
pstack <pid>
on the 4 MPI tasks
so we can get an idea of where they are stuck

Cheers,

Gilles
Gilles,
OpenMPI is now working using openib on nodes sm3 and sm4! However I am
still having some trouble getting openmpi to work over IPoIB. Using
the command,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile
host_file -np 4 ./congrad 400
with the hostfile,
sm3-ib slots=2
sm4-ib slots=2
Will cause the command to hang.
I ran your netcat test again on sm3 and sm4,
hello
Thanks,
Allan
Post by g***@rist.or.jp
Allan,
a firewall is running on your nodes as evidenced by the iptables
outputs.
if you do not need it, then you can simply disable it.
otherwise, you can run
iptables -I INPUT -i ib0 -j ACCEPT
iptables -I OUTPUT -o ib0 -j ACCEPT
on all your nodes and that might help
- note this allows *all* traffic on IPoIB
- some other rules in the 'nat' table might block some traffic
Cheers,
Gilles
************** ifconfig **************
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet 192.168.1.203 netmask 255.255.255.0 broadcast
192.168.1.255
inet6 fe80::225:90ff:fe51:aaad prefixlen 64 scopeid
0x20<link>
ether 00:25:90:51:aa:ad txqueuelen 1000 (Ethernet)
RX packets 7987 bytes 7158426 (6.8 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 4310 bytes 368291 (359.6 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp4s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 00:25:90:51:aa:ad txqueuelen 1000 (Ethernet)
RX packets 3970 bytes 3576526 (3.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2154 bytes 183276 (178.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfbde0000-fbdfffff
enp4s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 00:25:90:51:aa:ad txqueuelen 1000 (Ethernet)
RX packets 4017 bytes 3581900 (3.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2159 bytes 185665 (181.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfbd60000-fbd7ffff
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 10.1.0.4 netmask 255.255.255.0 broadcast 10.1.0.255
inet6 fe80::211:7500:79:90f6 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS
section in ifconfig(8).
infiniband
80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
txqueuelen 256 (InfiniBand)
RX packets 923 bytes 73596 (71.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 842 bytes 72724 (71.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1 (Local Loopback)
RX packets 80 bytes 7082 (6.9 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 80 bytes 7082 (6.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet 192.168.1.204 netmask 255.255.255.0 broadcast
192.168.1.255
inet6 fe80::225:90ff:fe27:9fe3 prefixlen 64 scopeid
0x20<link>
ether 00:25:90:27:9f:e3 txqueuelen 1000 (Ethernet)
RX packets 20815 bytes 8291279 (7.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 17168 bytes 2261794 (2.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp4s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 00:25:90:27:9f:e3 txqueuelen 1000 (Ethernet)
RX packets 10365 bytes 4157996 (3.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8584 bytes 1122518 (1.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfbde0000-fbdfffff
enp4s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether 00:25:90:27:9f:e3 txqueuelen 1000 (Ethernet)
RX packets 10450 bytes 4133283 (3.9 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8586 bytes 1139860 (1.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfbd60000-fbd7ffff
ib0: flags=4099<UP,BROADCAST,MULTICAST> mtu 2044
inet 10.1.0.5 netmask 255.255.255.0 broadcast 10.1.0.255
inet6 fe80::211:7500:79:86e4 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS
section in ifconfig(8).
infiniband
80:00:00:03:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
txqueuelen 256 (InfiniBand)
RX packets 902 bytes 72448 (70.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 832 bytes 71932 (70.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1 (Local Loopback)
RX packets 76 bytes 6430 (6.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 76 bytes 6430 (6.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
******************** brctl show ************************
bridge name bridge id STP enabled interfaces
bridge name bridge id STP enabled interfaces
******************* iptables -L ***********************
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere ctstate
RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
INPUT_direct all -- anywhere anywhere
INPUT_ZONES_SOURCE all -- anywhere anywhere
INPUT_ZONES all -- anywhere anywhere
DROP all -- anywhere anywhere ctstate INVALID
REJECT all -- anywhere anywhere reject-with
icmp-host-prohibited
Chain FORWARD (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere ctstate
RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
FORWARD_direct all -- anywhere anywhere
FORWARD_IN_ZONES_SOURCE all -- anywhere anywhere
FORWARD_IN_ZONES all -- anywhere anywhere
FORWARD_OUT_ZONES_SOURCE all -- anywhere anywhere
FORWARD_OUT_ZONES all -- anywhere anywhere
DROP all -- anywhere anywhere ctstate INVALID
REJECT all -- anywhere anywhere reject-with
icmp-host-prohibited
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
OUTPUT_direct all -- anywhere anywhere
Chain FORWARD_IN_ZONES (1 references)
target prot opt source destination
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
Chain FORWARD_IN_ZONES_SOURCE (1 references)
target prot opt source destination
Chain FORWARD_OUT_ZONES (1 references)
target prot opt source destination
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
Chain FORWARD_OUT_ZONES_SOURCE (1 references)
target prot opt source destination
Chain FORWARD_direct (1 references)
target prot opt source destination
Chain FWDI_public (5 references)
target prot opt source destination
FWDI_public_log all -- anywhere anywhere
FWDI_public_deny all -- anywhere anywhere
FWDI_public_allow all -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere
Chain FWDI_public_allow (1 references)
target prot opt source destination
Chain FWDI_public_deny (1 references)
target prot opt source destination
Chain FWDI_public_log (1 references)
target prot opt source destination
Chain FWDO_public (5 references)
target prot opt source destination
FWDO_public_log all -- anywhere anywhere
FWDO_public_deny all -- anywhere anywhere
FWDO_public_allow all -- anywhere anywhere
Chain FWDO_public_allow (1 references)
target prot opt source destination
Chain FWDO_public_deny (1 references)
target prot opt source destination
Chain FWDO_public_log (1 references)
target prot opt source destination
Chain INPUT_ZONES (1 references)
target prot opt source destination
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
Chain INPUT_ZONES_SOURCE (1 references)
target prot opt source destination
Chain INPUT_direct (1 references)
target prot opt source destination
Chain IN_public (5 references)
target prot opt source destination
IN_public_log all -- anywhere anywhere
IN_public_deny all -- anywhere anywhere
IN_public_allow all -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere
Chain IN_public_allow (1 references)
target prot opt source destination
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ctstate NEW
Chain IN_public_deny (1 references)
target prot opt source destination
Chain IN_public_log (1 references)
target prot opt source destination
Chain OUTPUT_direct (1 references)
target prot opt source destination
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere ctstate
RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
INPUT_direct all -- anywhere anywhere
INPUT_ZONES_SOURCE all -- anywhere anywhere
INPUT_ZONES all -- anywhere anywhere
DROP all -- anywhere anywhere ctstate INVALID
REJECT all -- anywhere anywhere reject-with
icmp-host-prohibited
Chain FORWARD (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere ctstate
RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
FORWARD_direct all -- anywhere anywhere
FORWARD_IN_ZONES_SOURCE all -- anywhere anywhere
FORWARD_IN_ZONES all -- anywhere anywhere
FORWARD_OUT_ZONES_SOURCE all -- anywhere anywhere
FORWARD_OUT_ZONES all -- anywhere anywhere
DROP all -- anywhere anywhere ctstate INVALID
REJECT all -- anywhere anywhere reject-with
icmp-host-prohibited
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
OUTPUT_direct all -- anywhere anywhere
Chain FORWARD_IN_ZONES (1 references)
target prot opt source destination
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
FWDI_public all -- anywhere anywhere [goto]
Chain FORWARD_IN_ZONES_SOURCE (1 references)
target prot opt source destination
Chain FORWARD_OUT_ZONES (1 references)
target prot opt source destination
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
FWDO_public all -- anywhere anywhere [goto]
Chain FORWARD_OUT_ZONES_SOURCE (1 references)
target prot opt source destination
Chain FORWARD_direct (1 references)
target prot opt source destination
Chain FWDI_public (5 references)
target prot opt source destination
FWDI_public_log all -- anywhere anywhere
FWDI_public_deny all -- anywhere anywhere
FWDI_public_allow all -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere
Chain FWDI_public_allow (1 references)
target prot opt source destination
Chain FWDI_public_deny (1 references)
target prot opt source destination
Chain FWDI_public_log (1 references)
target prot opt source destination
Chain FWDO_public (5 references)
target prot opt source destination
FWDO_public_log all -- anywhere anywhere
FWDO_public_deny all -- anywhere anywhere
FWDO_public_allow all -- anywhere anywhere
Chain FWDO_public_allow (1 references)
target prot opt source destination
Chain FWDO_public_deny (1 references)
target prot opt source destination
Chain FWDO_public_log (1 references)
target prot opt source destination
Chain INPUT_ZONES (1 references)
target prot opt source destination
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
IN_public all -- anywhere anywhere [goto]
Chain INPUT_ZONES_SOURCE (1 references)
target prot opt source destination
Chain INPUT_direct (1 references)
target prot opt source destination
Chain IN_public (5 references)
target prot opt source destination
IN_public_log all -- anywhere anywhere
IN_public_deny all -- anywhere anywhere
IN_public_allow all -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere
Chain IN_public_allow (1 references)
target prot opt source destination
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ctstate NEW
Chain IN_public_deny (1 references)
target prot opt source destination
Chain IN_public_log (1 references)
target prot opt source destination
Chain OUTPUT_direct (1 references)
target prot opt source destination
********************** iptables -t nat -L *******************
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
PREROUTING_direct all -- anywhere anywhere
PREROUTING_ZONES_SOURCE all -- anywhere anywhere
PREROUTING_ZONES all -- anywhere anywhere
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
OUTPUT_direct all -- anywhere anywhere
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
POSTROUTING_direct all -- anywhere anywhere
POSTROUTING_ZONES_SOURCE all -- anywhere anywhere
POSTROUTING_ZONES all -- anywhere anywhere
Chain OUTPUT_direct (1 references)
target prot opt source destination
Chain POSTROUTING_ZONES (1 references)
target prot opt source destination
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
Chain POSTROUTING_ZONES_SOURCE (1 references)
target prot opt source destination
Chain POSTROUTING_direct (1 references)
target prot opt source destination
Chain POST_public (5 references)
target prot opt source destination
POST_public_log all -- anywhere anywhere
POST_public_deny all -- anywhere anywhere
POST_public_allow all -- anywhere anywhere
Chain POST_public_allow (1 references)
target prot opt source destination
Chain POST_public_deny (1 references)
target prot opt source destination
Chain POST_public_log (1 references)
target prot opt source destination
Chain PREROUTING_ZONES (1 references)
target prot opt source destination
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
Chain PREROUTING_ZONES_SOURCE (1 references)
target prot opt source destination
Chain PREROUTING_direct (1 references)
target prot opt source destination
Chain PRE_public (5 references)
target prot opt source destination
PRE_public_log all -- anywhere anywhere
PRE_public_deny all -- anywhere anywhere
PRE_public_allow all -- anywhere anywhere
Chain PRE_public_allow (1 references)
target prot opt source destination
Chain PRE_public_deny (1 references)
target prot opt source destination
Chain PRE_public_log (1 references)
target prot opt source destination
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
PREROUTING_direct all -- anywhere anywhere
PREROUTING_ZONES_SOURCE all -- anywhere anywhere
PREROUTING_ZONES all -- anywhere anywhere
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
OUTPUT_direct all -- anywhere anywhere
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
POSTROUTING_direct all -- anywhere anywhere
POSTROUTING_ZONES_SOURCE all -- anywhere anywhere
POSTROUTING_ZONES all -- anywhere anywhere
Chain OUTPUT_direct (1 references)
target prot opt source destination
Chain POSTROUTING_ZONES (1 references)
target prot opt source destination
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
POST_public all -- anywhere anywhere [goto]
Chain POSTROUTING_ZONES_SOURCE (1 references)
target prot opt source destination
Chain POSTROUTING_direct (1 references)
target prot opt source destination
Chain POST_public (5 references)
target prot opt source destination
POST_public_log all -- anywhere anywhere
POST_public_deny all -- anywhere anywhere
POST_public_allow all -- anywhere anywhere
Chain POST_public_allow (1 references)
target prot opt source destination
Chain POST_public_deny (1 references)
target prot opt source destination
Chain POST_public_log (1 references)
target prot opt source destination
Chain PREROUTING_ZONES (1 references)
target prot opt source destination
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
PRE_public all -- anywhere anywhere [goto]
Chain PREROUTING_ZONES_SOURCE (1 references)
target prot opt source destination
Chain PREROUTING_direct (1 references)
target prot opt source destination
Chain PRE_public (5 references)
target prot opt source destination
PRE_public_log all -- anywhere anywhere
PRE_public_deny all -- anywhere anywhere
PRE_public_allow all -- anywhere anywhere
Chain PRE_public_allow (1 references)
target prot opt source destination
Chain PRE_public_deny (1 references)
target prot opt source destination
Chain PRE_public_log (1 references)
target prot opt source destination
I installed the libverbs dependencies and rebuild mpi. The shared
library ../install/lib/openmpi/mca_btl_openib.so now exists.
Post by g***@rist.or.jp
Allan,
the "No route to host" error indicates there is something going
wrong with IPoIB on your cluster
(and Open MPI is not involved whatsoever in that)
on sm3 and sm4, you can run
/sbin/ifconfig
brctl show
iptables -L
iptables -t nat -L
we might be able to figure out what is going wrong from that.
if there is no mca_btl_openib.so component, it is likely the
infiniband headers are not available on the node you compiled Open
MPI.
i guess if you configure Open MPI with
--with-verbs
it will abort if the headers are not found.
in this case, simply install them and rebuild Open MPI.
if you are unsure about that part, please compress and post your
config.log so we can have a look at it
Cheers,
gilles
Gilles,
I was able to ping sm4 from sm3 and sm3 from sm4. However running
netcat from sm4 and sm5 using the commands.
and
Ncat: No route to host.
Testing this on other nodes,
and
Ncat: No route to host.
These nodes do not have firewalls installed, so I am confused why
this traffic isn't getting through.
I am compiling openmpi from source and the shared library
/home/allan/software/openmpi/install/lib/openmpi/mca_btl_openib.so
doesn't exist.
Post by g***@rist.or.jp
Allan,
about IPoIB, the error message (no route to host) is very puzzling.
did you double check IPoIB is ok between all nodes ?
this error message suggests IPoIB is not working between sm3 and sm4,
this could be caused by the subnet manager, or a firewall.
ping is the first tool you should use to test that, then you can use nc
(netcat).
for example, on sm4
nc -l 1234
on sm3
echo hello | nc 10.1.0.5 1234
(expected result: "hello" should be displayed on sm4)
about openib, you first need to double check the btl/openib was built.
assuming you did not configure with --disable-dlopen, you should have a
mca_btl_openib.so
file in /.../lib/openmpi. it should be accessible by the user, and
ldd /.../lib/openmpi/mca_btl_openib.so
should not have any unresolved dependencies on *all* your nodes
Cheers,
Gilles
----- Original Message -----
Post by Allan Overstreet
I have been having some issues with using openmpi with tcp over IPoIB
and openib. The problems arise when I run a program that uses basic
collective communication. The two programs that I have been using are
attached.
*** IPoIB ***
The mpirun command I am using to run mpi over IPoIB is,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca
btl_tcp_include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./avg 8000
This program will appear to run on the nodes, but will sit at 100% CPU
and use no memory. On the host node an error will be printed,
[sm1][[58411,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
Using another program,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./congrad 800
Produces the following result. This program will also run on the nodes
sm1, sm2, sm3, and sm4 at 100% and use no memory.
[sm3][[61383,1],4][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],6][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],3][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.2 failed: No route to host (113)
[sm3][[61383,1],5][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.2 failed: No route to host (113)
[sm1][[61383,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
[sm1][[61383,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
Post by Allan Overstreet
connect() to 10.1.0.3 failed: No route to host (113)
*** openib ***
Running the congrad program over openib will produce the result,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 -hostfile nodes -np 8 ./avg 800
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
----------------------------------------------------------------------
----
Post by Allan Overstreet
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm2.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
Post by Allan Overstreet
----------------------------------------------------------------------
----
Post by Allan Overstreet
It looks like MPI_INIT failed for some reason; your parallel process
is
Post by Allan Overstreet
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
Post by Allan Overstreet
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
Post by Allan Overstreet
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 29 failed: Broken pipe (32)
Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],1]
usock_peer_send_connect_ack failed
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 27 failed: Broken pipe (32)
Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],0]
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[smd:31760] 4 more processes have sent help message
help-mca-base.txt
/
Post by Allan Overstreet
find-available:not-valid
[smd:31760] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all
Post by Allan Overstreet
help / error messages
[smd:31760] 4 more processes have sent help message
help-mpi-runtime.
txt
Post by Allan Overstreet
/ mpi_init:startup:internal-failure
=== Later errors printed out on the host node ===
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm3
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm1
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm2
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
Local host: sm4
Remote host: 10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
The ./avg process was not created on any of the nodes.
Running the ./congrad program,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
Post by Allan Overstreet
10.1.0.0/24 -hostfile nodes -np 8 ./congrad 800
Will results in the following errors,
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
----------------------------------------------------------------------
----
Post by Allan Overstreet
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: sm3.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
Post by Allan Overstreet
----------------------------------------------------------------------
----
Post by Allan Overstreet
It looks like MPI_INIT failed for some reason; your parallel process
is
Post by Allan Overstreet
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
Post by Allan Overstreet
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
Post by Allan Overstreet
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 29 failed: Broken pipe (32)
Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32271] [[57834,0],1]-[[57834,1],0]
usock_peer_send_connect_ack failed
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
Post by Allan Overstreet
to socket 27 failed: Broken pipe (32)
Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32271] [[57834,0],1]-[[57834,1],1]
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Allan Overstreet
*** and potentially your MPI job)
[smd:32088] 5 more processes have sent help message
help-mca-base.txt
/
Post by Allan Overstreet
find-available:not-valid
[smd:32088] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all
Post by Allan Overstreet
help / error messages
[smd:32088] 5 more processes have sent help message
help-mpi-runtime.
txt
Post by Allan Overstreet
/ mpi_init:startup:internal-failure
Using these mpirun commands will successfully run using a test program
that is only using point to point communication.
The nodes are interconnected in the following way.
__ _
[__]|=|
/::/|_|
HOST: smd
Dual 1Gb Ethernet Bonded
.-------------> Bond0 IP: 192.168.1.200
| Infiniband Card: MHQH29B-XTR <------------.
| Ib0 IP: 10.1.0.1 |
| OS: Ubuntu Mate |
| __ _ |
| [__]|=| |
| /::/|_| |
| HOST: sm1 |
| Dual 1Gb Ethernet Bonded |
|-------------> Bond0 IP: 192.168.1.196
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.2 |
| OS: Centos 7 Minimal |
| __ _ |
| [__]|=| |
|---------. /::/|_| |
| | HOST: sm2 |
| | Dual 1Gb Ethernet Bonded |
| '---> Bond0 IP: 192.168.1.199
|
Post by Allan Overstreet
__________ Infiniband Card: QLOGIC QLE7340 __________
[_|||||||_°] Ib0 IP: 10.1.0.3 [_|||||||_°]
[_|||||||_°] OS: Centos 7 Minimal [_|||||||_°]
[_|||||||_°] __ _ [_|||||||_°]
Gb Ethernet Switch [__]|=| Voltaire
4036
Post by Allan Overstreet
QDR Switch
| /::/|_| |
| HOST: sm3
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
|-------------> Bond0 IP: 192.168.1.203
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
-|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.4
|
Post by Allan Overstreet
| OS: Centos 7 Minimal
|
Post by Allan Overstreet
| __ _
|
Post by Allan Overstreet
| [__]|=| |
| /::/|_| |
| HOST: sm4
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
|-------------> Bond0 IP: 192.168.1.204
|
Post by Allan Overstreet
| Infiniband Card: QLOGIC QLE7340 <---------
-|
Post by Allan Overstreet
| Ib0 IP: 10.1.0.5
|
Post by Allan Overstreet
| OS: Centos 7 Minimal
|
Post by Allan Overstreet
| __ _
|
Post by Allan Overstreet
| [__]|=| |
| /::/|_| |
| HOST: dl580
|
Post by Allan Overstreet
| Dual 1Gb Ethernet Bonded
|
Post by Allan Overstreet
'-------------> Bond0 IP: 192.168.1.201
|
Post by Allan Overstreet
Infiniband Card: QLOGIC QLE7340 <---------
-'
Post by Allan Overstreet
Ib0 IP: 10.1.0.6
OS: Centos 7 Minimal
Thanks for the help again.
Sincerely,
Allan Overstreet
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...