Discussion:
[OMPI users] Problem with MPI jobs terminating when using OMPI 3.0.x
Andy Riebs
2017-10-27 20:24:21 UTC
Permalink
We have built a version of Open MPI 3.0.x that works with Slurm (our
primary use case), but it fails when executed without Slurm.

If I srun an MPI "hello world" program, it works just fine. Likewise, if
I salloc a couple of nodes and use mpirun from there, life is good. But
if I just try to mpirun the program without Slurm support, the program
appears to run to completion, and then segv's. A bit of good news is
that this can be reproduced with a single process.

Sample output and configuration information below:

[tests]$ cat gdb.cmd
set follow-fork-mode child
r
[tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols
found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff4be8700 (LWP 21386)]
[New Thread 0x7ffff3f70700 (LWP 21387)]
[New Thread 0x7fffeacac700 (LWP 21393)]
[Thread 0x7fffeacac700 (LWP 21393) exited]
[New Thread 0x7fffeacac700 (LWP 21394)]
Hello world! I'm 0 of 1 on node04
[Thread 0x7fffeacac700 (LWP 21394) exited]
[Thread 0x7ffff3f70700 (LWP 21387) exited]
[Thread 0x7ffff4be8700 (LWP 21386) exited]
[Inferior 1 (process 21382) exited normally]
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-157.el7.x86_64 libevent-2.0.21-4.el7.x86_64
libgcc-4.8.5-11.el7.x86_6
4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64
libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64
libibverbs-1.2.1mlnx1-OFED
.3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64
libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3.
el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64
libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open
sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) q
[node04:21373] *** Process received signal ***
[node04:21373] Signal: Segmentation fault (11)
[node04:21373] Signal code:  (128)
[node04:21373] Failing at address: (nil)
[node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7ffff60c4370]
[node04:21373] [ 1]
/opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7ffff365104b]
[node04:21373] [ 2]
/lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x7ffff64e4a14]
[node04:21373] [ 3]
/opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7ffff363f5cd]
[node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x7ffff60bcdc5]
[node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x7ffff5deb73d]
[node04:21373] *** End of error message ***
bash: line 1: 21373 Segmentation fault
/opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess "env" -mca ess_base_jobid
"399966208" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca
orte_node_regex "node[2:73],node[4:0]***@0(2)" -mca orte_hnp_uri "3
99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca
coll_tuned_use_dynamic_rules "1" -mca scoll "^mpi" -mca pml "ucx"
 -mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem
"mmap" -mca spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath
er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix
"^s1,s2,cray,isolated"

[tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort
OMPI_MCA_atomic=ucx
OMPI_MCA_coll=^hcoll
OMPI_MCA_coll_tuned_allgather_algorithm=2
OMPI_MCA_coll_tuned_allgatherv_algorithm=2
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_pml=ucx
OMPI_MCA_scoll=^mpi
OMPI_MCA_spml=ucx
OMPI_MCA_spml_ucx_heap_reg_nb=1
OMPI_MCA_sshmem=mmap
OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d
OPENMPI_VER=3.0.x.4ca1c4d
SLURM_DISTRIBUTION=block:block
SLURM_HINT=nomultithread
SLURM_SRUN_REDUCE_TASK_EXIT=1
SLURM_TEST_EXEC=1
SLURM_UNBUFFEREDIO=1
SLURM_VER=17.11.0-0pre2
UCX_TLS=dc_x
UCX_ZCOPY_THRESH=131072
[tests]$

OS: CentOS 7.3
HW: x86_64 (KNL)
OMPI version: 3.0.x.4ca1c4d
Configuration options:
        --prefix=/opt/local/shmem/3.0.x.4ca1c4d
--with-hcoll=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/hcoll

        --with-hwloc=/opt/local/hwloc/1.11.4
--with-knem=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/knem
        --with-libevent=/usr
--with-mxm=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/mxm
        --with-platform=contrib/platform/mellanox/optimized
        --with-pmi=/opt/local/slurm/default
        --with-pmix=/opt/local/pmix/1.2.1
        --with-slurm=/opt/local/slurm/default
        --with-ucx=/opt/local/ucx/1.3.0

Thoughts?
Andy
--
Andy Riebs
***@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
r***@open-mpi.org
2017-10-27 20:31:49 UTC
Permalink
Two questions:

1. are you running this on node04? Or do you have ssh access to node04?

2. I note you are building this against an old version of PMIx for some reason. Does it work okay if you build it with the embedded PMIx (which is 2.0)? Does it work okay if you use PMIx v1.2.4, the latest release in that series?
We have built a version of Open MPI 3.0.x that works with Slurm (our primary use case), but it fails when executed without Slurm.
If I srun an MPI "hello world" program, it works just fine. Likewise, if I salloc a couple of nodes and use mpirun from there, life is good. But if I just try to mpirun the program without Slurm support, the program appears to run to completion, and then segv's. A bit of good news is that this can be reproduced with a single process.
[tests]$ cat gdb.cmd
set follow-fork-mode child
r
[tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff4be8700 (LWP 21386)]
[New Thread 0x7ffff3f70700 (LWP 21387)]
[New Thread 0x7fffeacac700 (LWP 21393)]
[Thread 0x7fffeacac700 (LWP 21393) exited]
[New Thread 0x7fffeacac700 (LWP 21394)]
Hello world! I'm 0 of 1 on node04
[Thread 0x7fffeacac700 (LWP 21394) exited]
[Thread 0x7ffff3f70700 (LWP 21387) exited]
[Thread 0x7ffff4be8700 (LWP 21386) exited]
[Inferior 1 (process 21382) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-11.el7.x86_6
4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64 libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64 libibverbs-1.2.1mlnx1-OFED
.3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64 libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3.
el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64 libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open
sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) q
[node04:21373] *** Process received signal ***
[node04:21373] Signal: Segmentation fault (11)
[node04:21373] Signal code: (128)
[node04:21373] Failing at address: (nil)
[node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7ffff60c4370]
[node04:21373] [ 1] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7ffff365104b]
[node04:21373] [ 2] /lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x7ffff64e4a14]
[node04:21373] [ 3] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7ffff363f5cd]
[node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x7ffff60bcdc5]
[node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x7ffff5deb73d]
[node04:21373] *** End of error message ***
bash: line 1: 21373 Segmentation fault /opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess "env" -mca ess_base_jobid
99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca coll_tuned_use_dynamic_rules "1" -mca scoll "^mpi" -mca pml "ucx"
-mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem "mmap" -mca spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath
er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix "^s1,s2,cray,isolated"
[tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort
OMPI_MCA_atomic=ucx
OMPI_MCA_coll=^hcoll
OMPI_MCA_coll_tuned_allgather_algorithm=2
OMPI_MCA_coll_tuned_allgatherv_algorithm=2
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_pml=ucx
OMPI_MCA_scoll=^mpi
OMPI_MCA_spml=ucx
OMPI_MCA_spml_ucx_heap_reg_nb=1
OMPI_MCA_sshmem=mmap
OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d
OPENMPI_VER=3.0.x.4ca1c4d
SLURM_DISTRIBUTION=block:block
SLURM_HINT=nomultithread
SLURM_SRUN_REDUCE_TASK_EXIT=1
SLURM_TEST_EXEC=1
SLURM_UNBUFFEREDIO=1
SLURM_VER=17.11.0-0pre2
UCX_TLS=dc_x
UCX_ZCOPY_THRESH=131072
[tests]$
OS: CentOS 7.3
HW: x86_64 (KNL)
OMPI version: 3.0.x.4ca1c4d
--prefix=/opt/local/shmem/3.0.x.4ca1c4d
--with-hcoll=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/hcoll
--with-hwloc=/opt/local/hwloc/1.11.4
--with-knem=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/knem
--with-libevent=/usr
--with-mxm=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/mxm
--with-platform=contrib/platform/mellanox/optimized
--with-pmi=/opt/local/slurm/default
--with-pmix=/opt/local/pmix/1.2.1
--with-slurm=/opt/local/slurm/default
--with-ucx=/opt/local/ucx/1.3.0
Thoughts?
Andy
--
Andy Riebs
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gilles Gouaillardet
2017-10-30 05:53:05 UTC
Permalink
Andy,


The crash occurs in the orted daemon and not in the mpi_hello MPI app,

so you will not see anything useful in gdb.


you can use the attached launch agent script in order to get a stack
trace of orted.

your mpirun command line should be updated like this

mpirun --mca orte_launch_agent /.../launch_agent.sh -host node04 -np 1 ./mpi_hello



Cheers,

Gilles
Post by r***@open-mpi.org
1. are you running this on node04? Or do you have ssh access to node04?
2. I note you are building this against an old version of PMIx for some reason. Does it work okay if you build it with the embedded PMIx (which is 2.0)? Does it work okay if you use PMIx v1.2.4, the latest release in that series?
We have built a version of Open MPI 3.0.x that works with Slurm (our primary use case), but it fails when executed without Slurm.
If I srun an MPI "hello world" program, it works just fine. Likewise, if I salloc a couple of nodes and use mpirun from there, life is good. But if I just try to mpirun the program without Slurm support, the program appears to run to completion, and then segv's. A bit of good news is that this can be reproduced with a single process.
[tests]$ cat gdb.cmd
set follow-fork-mode child
r
[tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff4be8700 (LWP 21386)]
[New Thread 0x7ffff3f70700 (LWP 21387)]
[New Thread 0x7fffeacac700 (LWP 21393)]
[Thread 0x7fffeacac700 (LWP 21393) exited]
[New Thread 0x7fffeacac700 (LWP 21394)]
Hello world! I'm 0 of 1 on node04
[Thread 0x7fffeacac700 (LWP 21394) exited]
[Thread 0x7ffff3f70700 (LWP 21387) exited]
[Thread 0x7ffff4be8700 (LWP 21386) exited]
[Inferior 1 (process 21382) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-11.el7.x86_6
4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64 libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64 libibverbs-1.2.1mlnx1-OFED
.3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64 libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3.
el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64 libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open
sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) q
[node04:21373] *** Process received signal ***
[node04:21373] Signal: Segmentation fault (11)
[node04:21373] Signal code: (128)
[node04:21373] Failing at address: (nil)
[node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7ffff60c4370]
[node04:21373] [ 1] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7ffff365104b]
[node04:21373] [ 2] /lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x7ffff64e4a14]
[node04:21373] [ 3] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7ffff363f5cd]
[node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x7ffff60bcdc5]
[node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x7ffff5deb73d]
[node04:21373] *** End of error message ***
bash: line 1: 21373 Segmentation fault /opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess "env" -mca ess_base_jobid
99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca coll_tuned_use_dynamic_rules "1" -mca scoll "^mpi" -mca pml "ucx"
-mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem "mmap" -mca spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath
er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix "^s1,s2,cray,isolated"
[tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort
OMPI_MCA_atomic=ucx
OMPI_MCA_coll=^hcoll
OMPI_MCA_coll_tuned_allgather_algorithm=2
OMPI_MCA_coll_tuned_allgatherv_algorithm=2
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_pml=ucx
OMPI_MCA_scoll=^mpi
OMPI_MCA_spml=ucx
OMPI_MCA_spml_ucx_heap_reg_nb=1
OMPI_MCA_sshmem=mmap
OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d
OPENMPI_VER=3.0.x.4ca1c4d
SLURM_DISTRIBUTION=block:block
SLURM_HINT=nomultithread
SLURM_SRUN_REDUCE_TASK_EXIT=1
SLURM_TEST_EXEC=1
SLURM_UNBUFFEREDIO=1
SLURM_VER=17.11.0-0pre2
UCX_TLS=dc_x
UCX_ZCOPY_THRESH=131072
[tests]$
OS: CentOS 7.3
HW: x86_64 (KNL)
OMPI version: 3.0.x.4ca1c4d
--prefix=/opt/local/shmem/3.0.x.4ca1c4d
--with-hcoll=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/hcoll
--with-hwloc=/opt/local/hwloc/1.11.4
--with-knem=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/knem
--with-libevent=/usr
--with-mxm=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/mxm
--with-platform=contrib/platform/mellanox/optimized
--with-pmi=/opt/local/slurm/default
--with-pmix=/opt/local/pmix/1.2.1
--with-slurm=/opt/local/slurm/default
--with-ucx=/opt/local/ucx/1.3.0
Thoughts?
Andy
--
Andy Riebs
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Andy Riebs
2017-10-31 19:39:37 UTC
Permalink
As always, thanks for your help Ralph!

Cutting over to PMIx 1.2.4 solved the problem for me. (Slurm wasn't
happy building with PMIx v2.)

And yes, I had ssh access to node04.

(And Gilles, thanks for your note, as well.)

Andy
Post by r***@open-mpi.org
1. are you running this on node04? Or do you have ssh access to node04?
2. I note you are building this against an old version of PMIx for some reason. Does it work okay if you build it with the embedded PMIx (which is 2.0)? Does it work okay if you use PMIx v1.2.4, the latest release in that series?
We have built a version of Open MPI 3.0.x that works with Slurm (our primary use case), but it fails when executed without Slurm.
If I srun an MPI "hello world" program, it works just fine. Likewise, if I salloc a couple of nodes and use mpirun from there, life is good. But if I just try to mpirun the program without Slurm support, the program appears to run to completion, and then segv's. A bit of good news is that this can be reproduced with a single process.
[tests]$ cat gdb.cmd
set follow-fork-mode child
r
[tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff4be8700 (LWP 21386)]
[New Thread 0x7ffff3f70700 (LWP 21387)]
[New Thread 0x7fffeacac700 (LWP 21393)]
[Thread 0x7fffeacac700 (LWP 21393) exited]
[New Thread 0x7fffeacac700 (LWP 21394)]
Hello world! I'm 0 of 1 on node04
[Thread 0x7fffeacac700 (LWP 21394) exited]
[Thread 0x7ffff3f70700 (LWP 21387) exited]
[Thread 0x7ffff4be8700 (LWP 21386) exited]
[Inferior 1 (process 21382) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-11.el7.x86_6
4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64 libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64 libibverbs-1.2.1mlnx1-OFED
.3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64 libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3.
el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64 libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open
sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) q
[node04:21373] *** Process received signal ***
[node04:21373] Signal: Segmentation fault (11)
[node04:21373] Signal code: (128)
[node04:21373] Failing at address: (nil)
[node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7ffff60c4370]
[node04:21373] [ 1] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7ffff365104b]
[node04:21373] [ 2] /lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x7ffff64e4a14]
[node04:21373] [ 3] /opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7ffff363f5cd]
[node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x7ffff60bcdc5]
[node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x7ffff5deb73d]
[node04:21373] *** End of error message ***
bash: line 1: 21373 Segmentation fault /opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess "env" -mca ess_base_jobid
99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca coll_tuned_use_dynamic_rules "1" -mca scoll "^mpi" -mca pml "ucx"
-mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem "mmap" -mca spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath
er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix "^s1,s2,cray,isolated"
[tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort
OMPI_MCA_atomic=ucx
OMPI_MCA_coll=^hcoll
OMPI_MCA_coll_tuned_allgather_algorithm=2
OMPI_MCA_coll_tuned_allgatherv_algorithm=2
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_pml=ucx
OMPI_MCA_scoll=^mpi
OMPI_MCA_spml=ucx
OMPI_MCA_spml_ucx_heap_reg_nb=1
OMPI_MCA_sshmem=mmap
OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d
OPENMPI_VER=3.0.x.4ca1c4d
SLURM_DISTRIBUTION=block:block
SLURM_HINT=nomultithread
SLURM_SRUN_REDUCE_TASK_EXIT=1
SLURM_TEST_EXEC=1
SLURM_UNBUFFEREDIO=1
SLURM_VER=17.11.0-0pre2
UCX_TLS=dc_x
UCX_ZCOPY_THRESH=131072
[tests]$
OS: CentOS 7.3
HW: x86_64 (KNL)
OMPI version: 3.0.x.4ca1c4d
--prefix=/opt/local/shmem/3.0.x.4ca1c4d
--with-hcoll=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/hcoll
--with-hwloc=/opt/local/hwloc/1.11.4
--with-knem=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/knem
--with-libevent=/usr
--with-mxm=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/mxm
--with-platform=contrib/platform/mellanox/optimized
--with-pmi=/opt/local/slurm/default
--with-pmix=/opt/local/pmix/1.2.1
--with-slurm=/opt/local/slurm/default
--with-ucx=/opt/local/ucx/1.3.0
Thoughts?
Andy
--
Andy Riebs
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...