Discussion:
[OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)
William Mitchell
2018-02-09 20:08:31 UTC
Permalink
When I try to run an MPI program on a network with a shared file system and
connected by ethernet, I get the error message "tcp_peer_send_blocking:
send() to socket 9 failed: Broken pipe (32)" followed by some suggestions
of what could cause it, none of which are my problem. I have searched the
FAQ, mailing list archives, and googled the error message, with only a few
hits touching on it, none of which solved the problem.

This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran
(more detailed system information below).

Here are details on how I encounter the problem:

***@host1> cat hellompi.f90
program hello
include 'mpif.h'
integer rank, size, ierror, nl
character(len=MPI_MAX_PROCESSOR_NAME) :: hostname

call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello
world'
call MPI_FINALIZE(ierror)
end

***@host1> mpifort --showme
ifort -I/usr/include/openmpi-x86_64 -pthread -m64 -I/usr/lib64/openmpi/lib
-Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags
-L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi

***@host1> ifort --version
ifort (IFORT) 18.0.0 20170811
Copyright (C) 1985-2017 Intel Corporation. All rights reserved.

***@host1> mpifort -o hellompi hellompi.f90

[Note: it runs on 1 machine, but not on two]

***@host1> mpirun -np 2 hellompi
node 0 of 2 on host1.domain: Hello world
node 1 of 2 on host1.domain: Hello world

***@host1> cat hosts
host2.domain
host1.domain

***@host1> mpirun -np 2 --hostfile hosts hellompi
[host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to
socket 9 failed: Broken pipe (32)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
[suggested causes deleted]

Here is system information:

***@host2> cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

***@host1> uname -a
Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

***@host1> rpm -qa | grep openmpi
mpitests-openmpi-4.1-1.el7.x86_64
openmpi-1.10.6-2.el7.x86_64
openmpi-devel-1.10.6-2.el7.x86_64

***@host1> ompi_info --all
[Results of this command for each host are in the attached files.]

***@host1> ompi_info -v ompi full --parsable
ompi_info: Error: unknown option "-v"
[Is the request to run that command given on the Open MPI "Getting Help"
web page an error?]

***@host1> printenv | grep OMPI
MPI_COMPILER=openmpi-x86_64
OMPI_F77=ifort
OMPI_FC=ifort
OMPI_MCA_mpi_yield_when_idle=1
OMPI_MCA_btl=tcp,self

I am using ssh-agent, and I can ssh between the two hosts. In fact, from
host1 I can use ssh to request that host2 ssh back to host1:

***@host1> ssh -A host2 "ssh host1 hostname"
host1.domain

Any suggestions on how to solve this problem are appreciated.

Bill
George Bosilca
2018-02-09 21:58:22 UTC
Permalink
What are the settings of the firewall on your 2 nodes ?

George.
Post by William Mitchell
When I try to run an MPI program on a network with a shared file system
send() to socket 9 failed: Broken pipe (32)" followed by some suggestions
of what could cause it, none of which are my problem. I have searched the
FAQ, mailing list archives, and googled the error message, with only a few
hits touching on it, none of which solved the problem.
This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran
(more detailed system information below).
program hello
include 'mpif.h'
integer rank, size, ierror, nl
character(len=MPI_MAX_PROCESSOR_NAME) :: hostname
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello
world'
call MPI_FINALIZE(ierror)
end
ifort -I/usr/include/openmpi-x86_64 -pthread -m64 -I/usr/lib64/openmpi/lib
-Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags
-L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh -lmpi
ifort (IFORT) 18.0.0 20170811
Copyright (C) 1985-2017 Intel Corporation. All rights reserved.
[Note: it runs on 1 machine, but not on two]
node 0 of 2 on host1.domain: Hello world
node 1 of 2 on host1.domain: Hello world
host2.domain
host1.domain
[host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to
socket 9 failed: Broken pipe (32)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
[suggested causes deleted]
CentOS Linux release 7.4.1708 (Core)
Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
mpitests-openmpi-4.1-1.el7.x86_64
openmpi-1.10.6-2.el7.x86_64
openmpi-devel-1.10.6-2.el7.x86_64
[Results of this command for each host are in the attached files.]
ompi_info: Error: unknown option "-v"
[Is the request to run that command given on the Open MPI "Getting Help"
web page an error?]
MPI_COMPILER=openmpi-x86_64
OMPI_F77=ifort
OMPI_FC=ifort
OMPI_MCA_mpi_yield_when_idle=1
OMPI_MCA_btl=tcp,self
I am using ssh-agent, and I can ssh between the two hosts. In fact, from
host1.domain
Any suggestions on how to solve this problem are appreciated.
Bill
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
William Mitchell
2018-02-12 12:23:13 UTC
Permalink
Thanks, George. My sysadmin now says he is pretty sure it is the firewall,
but that "isn't going to change" so we need to find a solution.
Post by George Bosilca
What are the settings of the firewall on your 2 nodes ?
George.
Post by William Mitchell
When I try to run an MPI program on a network with a shared file system
send() to socket 9 failed: Broken pipe (32)" followed by some suggestions
of what could cause it, none of which are my problem. I have searched the
FAQ, mailing list archives, and googled the error message, with only a few
hits touching on it, none of which solved the problem.
This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran
(more detailed system information below).
program hello
include 'mpif.h'
integer rank, size, ierror, nl
character(len=MPI_MAX_PROCESSOR_NAME) :: hostname
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello
world'
call MPI_FINALIZE(ierror)
end
ifort -I/usr/include/openmpi-x86_64 -pthread -m64
-I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib
-Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh
-lmpi
ifort (IFORT) 18.0.0 20170811
Copyright (C) 1985-2017 Intel Corporation. All rights reserved.
[Note: it runs on 1 machine, but not on two]
node 0 of 2 on host1.domain: Hello world
node 1 of 2 on host1.domain: Hello world
host2.domain
host1.domain
[host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to
socket 9 failed: Broken pipe (32)
------------------------------------------------------------
--------------
ORTE was unable to reliably start one or more daemons.
[suggested causes deleted]
CentOS Linux release 7.4.1708 (Core)
Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
mpitests-openmpi-4.1-1.el7.x86_64
openmpi-1.10.6-2.el7.x86_64
openmpi-devel-1.10.6-2.el7.x86_64
[Results of this command for each host are in the attached files.]
ompi_info: Error: unknown option "-v"
[Is the request to run that command given on the Open MPI "Getting Help"
web page an error?]
MPI_COMPILER=openmpi-x86_64
OMPI_F77=ifort
OMPI_FC=ifort
OMPI_MCA_mpi_yield_when_idle=1
OMPI_MCA_btl=tcp,self
I am using ssh-agent, and I can ssh between the two hosts. In fact, from
host1.domain
Any suggestions on how to solve this problem are appreciated.
Bill
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gilles Gouaillardet
2018-02-12 12:32:02 UTC
Permalink
William,

On a typical HPC cluster, the internal interface is not protected by
the firewall.
If this is eth0, then you can

mpirun --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 ...

If only a small range of port is available, then you will also need to use the

oob_tcp_dynamic_ipv4_ports, btl_tcp_port_min_v4 and
btl_tcp_port_range_v4 MCA params in order to tell MPI which range of
ports are open.

Cheers,

Gilles
Post by William Mitchell
Thanks, George. My sysadmin now says he is pretty sure it is the firewall,
but that "isn't going to change" so we need to find a solution.
Post by George Bosilca
What are the settings of the firewall on your 2 nodes ?
George.
Post by William Mitchell
When I try to run an MPI program on a network with a shared file system
send() to socket 9 failed: Broken pipe (32)" followed by some suggestions of
what could cause it, none of which are my problem. I have searched the FAQ,
mailing list archives, and googled the error message, with only a few hits
touching on it, none of which solved the problem.
This is on a Linux CentOS 7 system with Open MPI 1.10.6 and Intel Fortran
(more detailed system information below).
program hello
include 'mpif.h'
integer rank, size, ierror, nl
character(len=MPI_MAX_PROCESSOR_NAME) :: hostname
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello
world'
call MPI_FINALIZE(ierror)
end
ifort -I/usr/include/openmpi-x86_64 -pthread -m64
-I/usr/lib64/openmpi/lib -Wl,-rpath -Wl,/usr/lib64/openmpi/lib
-Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi_usempi -lmpi_mpifh
-lmpi
ifort (IFORT) 18.0.0 20170811
Copyright (C) 1985-2017 Intel Corporation. All rights reserved.
[Note: it runs on 1 machine, but not on two]
node 0 of 2 on host1.domain: Hello world
node 1 of 2 on host1.domain: Hello world
host2.domain
host1.domain
[host2.domain:250313] [[46562,0],1] tcp_peer_send_blocking: send() to
socket 9 failed: Broken pipe (32)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
[suggested causes deleted]
CentOS Linux release 7.4.1708 (Core)
Linux host1.domain 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58
UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
mpitests-openmpi-4.1-1.el7.x86_64
openmpi-1.10.6-2.el7.x86_64
openmpi-devel-1.10.6-2.el7.x86_64
[Results of this command for each host are in the attached files.]
ompi_info: Error: unknown option "-v"
[Is the request to run that command given on the Open MPI "Getting Help"
web page an error?]
MPI_COMPILER=openmpi-x86_64
OMPI_F77=ifort
OMPI_FC=ifort
OMPI_MCA_mpi_yield_when_idle=1
OMPI_MCA_btl=tcp,self
I am using ssh-agent, and I can ssh between the two hosts. In fact, from
host1.domain
Any suggestions on how to solve this problem are appreciated.
Bill
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...