Discussion:
[OMPI users] Fwd: srun works, mpirun does not
Bennet Fauber
2018-06-17 16:07:56 UTC
Permalink
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.

Am I missing some command line option for mpirun?

OMPI built from this configure command

$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran

All tests from `make check` passed, see below.

[***@cavium-hpc ~]$ mpicc --show
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi

The test_mpi was compiled with

$ gcc -o test_mpi test_mpi.c -lm

This is the runtime library path

[***@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib


These commands are given in exact sequence in which they were entered
at a console.

[***@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156

[***@cavium-hpc ~]$ mpirun ./test_mpi
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

[***@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622

Thanks, -- bennet


make check results
----------------------------------------------

make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
r***@open-mpi.org
2018-06-17 17:05:56 UTC
Permalink
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-17 21:51:46 UTC
Permalink
I rebuilt with --enable-debug, then ran with

[***@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158

[***@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830

[***@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug2.log

The zipped debug log should be attached.

I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.

[***@cavium-hpc ~]$ mpirun hostname
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

[***@cavium-hpc ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
[***@cavium-hpc ~]$ srun hostname
cav01.arc-ts.umich.edu
[ repeated 23 more times ]

As always, your help is much appreciated,

-- bennet
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-06-18 14:55:13 UTC
Permalink
Hmmm...well, the error has changed from your initial report. Turning off the firewall was the solution to that problem.

This problem is different - it isn’t the orted that failed in the log you sent, but the application proc that couldn’t initialize. It looks like that app was compiled against some earlier version of OMPI? It is looking for something that no longer exists. I saw that you compiled it with a simple “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My guess is that your compile picked up some older version of OMPI on the system.

Ralph
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Post by r***@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug2.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-18 19:57:05 UTC
Permalink
To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.

[***@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran

Currently Loaded Modules:
1) gcc/7.1.0

gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Using the following configure command

./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen
--enable-debug CC=gcc CXX=g++ FC=gfortran

The tar ball is

2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz

I still get

[***@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
[***@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 5.425549
The sum = 0.866386
Elapsed time is: 5.422826
The sum = 0.866386
Elapsed time is: 5.427676
The sum = 0.866386
Elapsed time is: 5.424928
The sum = 0.866386
Elapsed time is: 5.422060
The sum = 0.866386
Elapsed time is: 5.425431
The sum = 0.866386
Elapsed time is: 5.424350
The sum = 0.866386
Elapsed time is: 5.423037
The sum = 0.866386
Elapsed time is: 5.427727
The sum = 0.866386
Elapsed time is: 5.424922
The sum = 0.866386
Elapsed time is: 5.424279
Total time is: 59.672992

[***@cavium-hpc ~]$ mpirun ./test_mpi
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

I reran with

[***@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug3.log

and the gzipped log is attached.

I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS


At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.

I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.

-- bennet
Post by r***@open-mpi.org
Hmmm...well, the error has changed from your initial report. Turning off the firewall was the solution to that problem.
This problem is different - it isn’t the orted that failed in the log you sent, but the application proc that couldn’t initialize. It looks like that app was compiled against some earlier version of OMPI? It is looking for something that no longer exists. I saw that you compiled it with a simple “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My guess is that your compile picked up some older version of OMPI on the system.
Ralph
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug2.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-06-18 20:04:38 UTC
Permalink
I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your mpirun cmd line and see if that works.
Post by Bennet Fauber
To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran
1) gcc/7.1.0
gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using the following configure command
./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen
--enable-debug CC=gcc CXX=g++ FC=gfortran
The tar ball is
2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
I still get
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
The sum = 0.866386
Elapsed time is: 5.425549
The sum = 0.866386
Elapsed time is: 5.422826
The sum = 0.866386
Elapsed time is: 5.427676
The sum = 0.866386
Elapsed time is: 5.424928
The sum = 0.866386
Elapsed time is: 5.422060
The sum = 0.866386
Elapsed time is: 5.425431
The sum = 0.866386
Elapsed time is: 5.424350
The sum = 0.866386
Elapsed time is: 5.423037
The sum = 0.866386
Elapsed time is: 5.427727
The sum = 0.866386
Elapsed time is: 5.424922
The sum = 0.866386
Elapsed time is: 5.424279
Total time is: 59.672992
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
I reran with
2>&1 | tee debug3.log
and the gzipped log is attached.
I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.
I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.
-- bennet
Post by r***@open-mpi.org
Hmmm...well, the error has changed from your initial report. Turning off the firewall was the solution to that problem.
This problem is different - it isn’t the orted that failed in the log you sent, but the application proc that couldn’t initialize. It looks like that app was compiled against some earlier version of OMPI? It is looking for something that no longer exists. I saw that you compiled it with a simple “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My guess is that your compile picked up some older version of OMPI on the system.
Ralph
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Post by r***@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug2.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug3.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-18 20:21:42 UTC
Permalink
No such luck. If it matters, mpirun does seem to work with processes
on the local node that have no internal MPI code. That is,

[***@cavium-hpc ~]$ mpirun -np 4 hello
Hello, ARM
Hello, ARM
Hello, ARM
Hello, ARM

but it fails with a similar error if run while a SLURM job is active; i.e.,

[***@cavium-hpc ~]$ mpirun hello
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[26589,0],0] on node cavium-hpc
Remote daemon: [[26589,0],1] on node cav01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

That makes sense, I guess.

I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.


Thanks, -- bennet
Post by r***@open-mpi.org
I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your mpirun cmd line and see if that works.
Post by Bennet Fauber
To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran
1) gcc/7.1.0
gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using the following configure command
./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen
--enable-debug CC=gcc CXX=g++ FC=gfortran
The tar ball is
2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
I still get
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
The sum = 0.866386
Elapsed time is: 5.425549
The sum = 0.866386
Elapsed time is: 5.422826
The sum = 0.866386
Elapsed time is: 5.427676
The sum = 0.866386
Elapsed time is: 5.424928
The sum = 0.866386
Elapsed time is: 5.422060
The sum = 0.866386
Elapsed time is: 5.425431
The sum = 0.866386
Elapsed time is: 5.424350
The sum = 0.866386
Elapsed time is: 5.423037
The sum = 0.866386
Elapsed time is: 5.427727
The sum = 0.866386
Elapsed time is: 5.424922
The sum = 0.866386
Elapsed time is: 5.424279
Total time is: 59.672992
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
I reran with
2>&1 | tee debug3.log
and the gzipped log is attached.
I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.
I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.
-- bennet
Post by r***@open-mpi.org
Hmmm...well, the error has changed from your initial report. Turning off the firewall was the solution to that problem.
This problem is different - it isn’t the orted that failed in the log you sent, but the application proc that couldn’t initialize. It looks like that app was compiled against some earlier version of OMPI? It is looking for something that no longer exists. I saw that you compiled it with a simple “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My guess is that your compile picked up some older version of OMPI on the system.
Ralph
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Post by r***@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug2.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug3.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-18 20:27:51 UTC
Permalink
If it's of any use, 3.0.0 seems to hang at

Making check in class
make[2]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make ompi_rb_tree opal_bitmap opal_hash_table opal_proc_table
opal_tree opal_list opal_value_array opal_pointer_array opal_lifo
opal_fifo
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[3]: `ompi_rb_tree' is up to date.
make[3]: `opal_bitmap' is up to date.
make[3]: `opal_hash_table' is up to date.
make[3]: `opal_proc_table' is up to date.
make[3]: `opal_tree' is up to date.
make[3]: `opal_list' is up to date.
make[3]: `opal_value_array' is up to date.
make[3]: `opal_pointer_array' is up to date.
make[3]: `opal_lifo' is up to date.
make[3]: `opal_fifo' is up to date.
make[3]: Leaving directory `/tmp/build/openmpi-3.0.0/test/class'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[4]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'

I have to interrupt it, but it's been many minutes, and usually these
have not been behaving this way.

-- bennet
Post by Bennet Fauber
No such luck. If it matters, mpirun does seem to work with processes
on the local node that have no internal MPI code. That is,
Hello, ARM
Hello, ARM
Hello, ARM
Hello, ARM
but it fails with a similar error if run while a SLURM job is active; i.e.,
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[26589,0],0] on node cavium-hpc
Remote daemon: [[26589,0],1] on node cav01
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
That makes sense, I guess.
I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.
Thanks, -- bennet
Post by r***@open-mpi.org
I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your mpirun cmd line and see if that works.
Post by Bennet Fauber
To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran
1) gcc/7.1.0
gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using the following configure command
./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen
--enable-debug CC=gcc CXX=g++ FC=gfortran
The tar ball is
2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
I still get
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
The sum = 0.866386
Elapsed time is: 5.425549
The sum = 0.866386
Elapsed time is: 5.422826
The sum = 0.866386
Elapsed time is: 5.427676
The sum = 0.866386
Elapsed time is: 5.424928
The sum = 0.866386
Elapsed time is: 5.422060
The sum = 0.866386
Elapsed time is: 5.425431
The sum = 0.866386
Elapsed time is: 5.424350
The sum = 0.866386
Elapsed time is: 5.423037
The sum = 0.866386
Elapsed time is: 5.427727
The sum = 0.866386
Elapsed time is: 5.424922
The sum = 0.866386
Elapsed time is: 5.424279
Total time is: 59.672992
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
I reran with
2>&1 | tee debug3.log
and the gzipped log is attached.
I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.
I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.
-- bennet
Post by r***@open-mpi.org
Hmmm...well, the error has changed from your initial report. Turning off the firewall was the solution to that problem.
This problem is different - it isn’t the orted that failed in the log you sent, but the application proc that couldn’t initialize. It looks like that app was compiled against some earlier version of OMPI? It is looking for something that no longer exists. I saw that you compiled it with a simple “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My guess is that your compile picked up some older version of OMPI on the system.
Ralph
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Post by r***@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug2.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug3.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-06-18 20:31:42 UTC
Permalink
This is on an ARM processor? I suspect that is the root of the problems as we aren’t seeing anything like this elsewhere.
Post by Bennet Fauber
If it's of any use, 3.0.0 seems to hang at
Making check in class
make[2]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make ompi_rb_tree opal_bitmap opal_hash_table opal_proc_table
opal_tree opal_list opal_value_array opal_pointer_array opal_lifo
opal_fifo
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[3]: `ompi_rb_tree' is up to date.
make[3]: `opal_bitmap' is up to date.
make[3]: `opal_hash_table' is up to date.
make[3]: `opal_proc_table' is up to date.
make[3]: `opal_tree' is up to date.
make[3]: `opal_list' is up to date.
make[3]: `opal_value_array' is up to date.
make[3]: `opal_pointer_array' is up to date.
make[3]: `opal_lifo' is up to date.
make[3]: `opal_fifo' is up to date.
make[3]: Leaving directory `/tmp/build/openmpi-3.0.0/test/class'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[4]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
I have to interrupt it, but it's been many minutes, and usually these
have not been behaving this way.
-- bennet
Post by Bennet Fauber
No such luck. If it matters, mpirun does seem to work with processes
on the local node that have no internal MPI code. That is,
Hello, ARM
Hello, ARM
Hello, ARM
Hello, ARM
but it fails with a similar error if run while a SLURM job is active; i.e.,
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[26589,0],0] on node cavium-hpc
Remote daemon: [[26589,0],1] on node cav01
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
That makes sense, I guess.
I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.
Thanks, -- bennet
Post by r***@open-mpi.org
I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your mpirun cmd line and see if that works.
Post by Bennet Fauber
To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran
1) gcc/7.1.0
gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using the following configure command
./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen
--enable-debug CC=gcc CXX=g++ FC=gfortran
The tar ball is
2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
I still get
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
The sum = 0.866386
Elapsed time is: 5.425549
The sum = 0.866386
Elapsed time is: 5.422826
The sum = 0.866386
Elapsed time is: 5.427676
The sum = 0.866386
Elapsed time is: 5.424928
The sum = 0.866386
Elapsed time is: 5.422060
The sum = 0.866386
Elapsed time is: 5.425431
The sum = 0.866386
Elapsed time is: 5.424350
The sum = 0.866386
Elapsed time is: 5.423037
The sum = 0.866386
Elapsed time is: 5.427727
The sum = 0.866386
Elapsed time is: 5.424922
The sum = 0.866386
Elapsed time is: 5.424279
Total time is: 59.672992
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
I reran with
2>&1 | tee debug3.log
and the gzipped log is attached.
I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.
I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.
-- bennet
Post by r***@open-mpi.org
Hmmm...well, the error has changed from your initial report. Turning off the firewall was the solution to that problem.
This problem is different - it isn’t the orted that failed in the log you sent, but the application proc that couldn’t initialize. It looks like that app was compiled against some earlier version of OMPI? It is looking for something that no longer exists. I saw that you compiled it with a simple “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My guess is that your compile picked up some older version of OMPI on the system.
Ralph
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Post by r***@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug2.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
<debug3.log.gz>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Ryan Novosielski
2018-06-18 20:42:37 UTC
Permalink
What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM MPI is set to “none”, or was last I checked, and so isn’t necessarily doing MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way (OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a difference:

[***@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
[slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7)
srun: error: slepner032: task 10: Bus error

[***@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 processors
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
<debug2.log.gz>_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - ***@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
Bennet Fauber
2018-06-19 01:02:18 UTC
Permalink
Ryan,

With srun it's fine. Only with mpirun is there a problem, and that is
both on a single node and on multiple nodes. SLURM was built against
pmix 2.0.2, and I am pretty sure that SLURM's default is pmix. We are
running a recent patch of SLURM, I think. SLURM and OMPI are both
being built using the same installation of pmix.

[***@cavium-hpc etc]$ srun --version
slurm 17.11.7

[***@cavium-hpc etc]$ grep pmi slurm.conf
MpiDefault=pmix

[***@cavium-hpc pmix]$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix

I think I said that I was pretty sure I had got this to work with both
mpirun and srun at one point, but I am unable to find the magic a
second time.
Post by Ryan Novosielski
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
[slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7)
srun: error: slepner032: task 10: Bus error
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 processors
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Post by r***@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
<debug2.log.gz>_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-19 04:51:41 UTC
Permalink
Well, this is kind of interesting. I can strip the configure line
back and get mpirun to work on one node, but then neither srun nor
mpirun within a SLURM job will run. I can add back configure options
to get to

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-slurm

and the situation does not seem to change. Then I add libevent,

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-libevent=external \
--with-slurm

and it works again with srun but fails to run the binary with mpirun.

It is late, and I am baffled.
Post by Bennet Fauber
Ryan,
With srun it's fine. Only with mpirun is there a problem, and that is
both on a single node and on multiple nodes. SLURM was built against
pmix 2.0.2, and I am pretty sure that SLURM's default is pmix. We are
running a recent patch of SLURM, I think. SLURM and OMPI are both
being built using the same installation of pmix.
slurm 17.11.7
MpiDefault=pmix
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
I think I said that I was pretty sure I had got this to work with both
mpirun and srun at one point, but I am unable to find the magic a
second time.
Post by Ryan Novosielski
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors
[slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7)
srun: error: slepner032: task 10: Bus error
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 processors
Post by Bennet Fauber
I rebuilt with --enable-debug, then ran with
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830
2>&1 | tee debug2.log
The zipped debug log should be attached.
I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
cav01.arc-ts.umich.edu
[ repeated 23 more times ]
As always, your help is much appreciated,
-- bennet
Post by r***@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.
Post by Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.
Am I missing some command line option for mpirun?
OMPI built from this configure command
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran
All tests from `make check` passed, see below.
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
The test_mpi was compiled with
$ gcc -o test_mpi test_mpi.c -lm
This is the runtime library path
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
These commands are given in exact sequence in which they were entered
at a console.
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622
Thanks, -- bennet
make check results
----------------------------------------------
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
<debug2.log.gz>_______________________________________________
users mailing list
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Nathan Hjelm
2018-06-18 20:43:43 UTC
Permalink
I have mpirun on master working on an aarch64 system with slurm. Will take a look and see if I had to do anything special to get it working.

-Nathan

On Jun 18, 2018, at 02:40 PM, "***@open-mpi.org" <***@open-mpi.org> wrote:

This is on an ARM processor? I suspect that is the root of the problems as we aren’t seeing anything like this elsewhere.


On Jun 18, 2018, at 1:27 PM, Bennet Fauber <***@umich.edu> wrote:

If it's of any use, 3.0.0 seems to hang at

Making check in class
make[2]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make ompi_rb_tree opal_bitmap opal_hash_table opal_proc_table
opal_tree opal_list opal_value_array opal_pointer_array opal_lifo
opal_fifo
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[3]: `ompi_rb_tree' is up to date.
make[3]: `opal_bitmap' is up to date.
make[3]: `opal_hash_table' is up to date.
make[3]: `opal_proc_table' is up to date.
make[3]: `opal_tree' is up to date.
make[3]: `opal_list' is up to date.
make[3]: `opal_value_array' is up to date.
make[3]: `opal_pointer_array' is up to date.
make[3]: `opal_lifo' is up to date.
make[3]: `opal_fifo' is up to date.
make[3]: Leaving directory `/tmp/build/openmpi-3.0.0/test/class'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'
make[4]: Entering directory `/tmp/build/openmpi-3.0.0/test/class'

I have to interrupt it, but it's been many minutes, and usually these
have not been behaving this way.

-- bennet

On Mon, Jun 18, 2018 at 4:21 PM Bennet Fauber <***@umich.edu> wrote:

No such luck. If it matters, mpirun does seem to work with processes
on the local node that have no internal MPI code. That is,

[***@cavium-hpc ~]$ mpirun -np 4 hello
Hello, ARM
Hello, ARM
Hello, ARM
Hello, ARM

but it fails with a similar error if run while a SLURM job is active; i.e.,

[***@cavium-hpc ~]$ mpirun hello
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[26589,0],0] on node cavium-hpc
Remote daemon: [[26589,0],1] on node cav01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

That makes sense, I guess.

I'll keep you posted as to what happens with 3.0.0 and downgrading SLURM.


Thanks, -- bennet


On Mon, Jun 18, 2018 at 4:05 PM ***@open-mpi.org <***@open-mpi.org> wrote:

I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your mpirun cmd line and see if that works.


On Jun 18, 2018, at 12:57 PM, Bennet Fauber <***@umich.edu> wrote:

To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.

[***@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran

Currently Loaded Modules:
1) gcc/7.1.0

gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Using the following configure command

./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen
--enable-debug CC=gcc CXX=g++ FC=gfortran

The tar ball is

2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz

I still get

[***@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
[***@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 5.425549
The sum = 0.866386
Elapsed time is: 5.422826
The sum = 0.866386
Elapsed time is: 5.427676
The sum = 0.866386
Elapsed time is: 5.424928
The sum = 0.866386
Elapsed time is: 5.422060
The sum = 0.866386
Elapsed time is: 5.425431
The sum = 0.866386
Elapsed time is: 5.424350
The sum = 0.866386
Elapsed time is: 5.423037
The sum = 0.866386
Elapsed time is: 5.427727
The sum = 0.866386
Elapsed time is: 5.424922
The sum = 0.866386
Elapsed time is: 5.424279
Total time is: 59.672992

[***@cavium-hpc ~]$ mpirun ./test_mpi
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

I reran with

[***@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug3.log

and the gzipped log is attached.

I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS


At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.

I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.

-- bennet


-- bennetOn Mon, Jun 18, 2018 at 10:56 AM ***@open-mpi.org
<***@open-mpi.org> wrote:

Hmmm...well, the error has changed from your initial report. Turning off the firewall was the solution to that problem.

This problem is different - it isn’t the orted that failed in the log you sent, but the application proc that couldn’t initialize. It looks like that app was compiled against some earlier version of OMPI? It is looking for something that no longer exists. I saw that you compiled it with a simple “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My guess is that your compile picked up some older version of OMPI on the system.

Ralph


On Jun 17, 2018, at 2:51 PM, Bennet Fauber <***@umich.edu> wrote:

I rebuilt with --enable-debug, then ran with

[***@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158

[***@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 5.426759
The sum = 0.866386
Elapsed time is: 5.424068
The sum = 0.866386
Elapsed time is: 5.426195
The sum = 0.866386
Elapsed time is: 5.426059
The sum = 0.866386
Elapsed time is: 5.423192
The sum = 0.866386
Elapsed time is: 5.426252
The sum = 0.866386
Elapsed time is: 5.425444
The sum = 0.866386
Elapsed time is: 5.423647
The sum = 0.866386
Elapsed time is: 5.426082
The sum = 0.866386
Elapsed time is: 5.425936
The sum = 0.866386
Elapsed time is: 5.423964
Total time is: 59.677830

[***@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug2.log

The zipped debug log should be attached.

I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.

[***@cavium-hpc ~]$ mpirun hostname
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

[***@cavium-hpc ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
158 standard bash bennet R 14:30 1 cav01
[***@cavium-hpc ~]$ srun hostname
cav01.arc-ts.umich.edu
[ repeated 23 more times ]

As always, your help is much appreciated,

-- bennet

On Sun, Jun 17, 2018 at 1:06 PM ***@open-mpi.org <***@open-mpi.org> wrote:

Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why.


On Jun 17, 2018, at 9:07 AM, Bennet Fauber <***@umich.edu> wrote:

I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.

Am I missing some command line option for mpirun?

OMPI built from this configure command

$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran

All tests from `make check` passed, see below.

[***@cavium-hpc ~]$ mpicc --show
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi

The test_mpi was compiled with

$ gcc -o test_mpi test_mpi.c -lm

This is the runtime library path

[***@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib


These commands are given in exact sequence in which they were entered
at a console.

[***@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156

[***@cavium-hpc ~]$ mpirun ./test_mpi
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

[***@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 5.425439
The sum = 0.866386
Elapsed time is: 5.427427
The sum = 0.866386
Elapsed time is: 5.422579
The sum = 0.866386
Elapsed time is: 5.424168
The sum = 0.866386
Elapsed time is: 5.423951
The sum = 0.866386
Elapsed time is: 5.422414
The sum = 0.866386
Elapsed time is: 5.427156
The sum = 0.866386
Elapsed time is: 5.424834
The sum = 0.866386
Elapsed time is: 5.425103
The sum = 0.866386
Elapsed time is: 5.422415
The sum = 0.866386
Elapsed time is: 5.422948
Total time is: 59.668622

Thanks, -- bennet


make check results
----------------------------------------------

make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 2
# SKIP: 1
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 8
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
CC opal_thread.o
CCLD opal_thread
CC opal_condition.o
CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 0
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
PASS: opal_datatype_test
PASS: unpack_hetero
PASS: checksum
PASS: position
PASS: position_noncontig
PASS: ddt_test
PASS: ddt_raw
PASS: unpack_ooo
PASS: ddt_pack
PASS: external32
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 10
# PASS: 10
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
PASS: opal_bit_ops
PASS: opal_path_nfs
PASS: bipartite_graph
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
PASS: dss_buffer
PASS: dss_cmp
PASS: dss_payload
PASS: dss_print
============================================================================
Testsuite summary for Open MPI 3.1.0
============================================================================
# TOTAL: 4
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
<debug2.log.gz>_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
<debug3.log.gz>_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
Loading...