Discussion:
[OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux
Siegmar Gross
2017-01-07 09:30:24 UTC
Permalink
Hi,

I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.

I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.


Kind regards

Siegmar
Howard Pritchard
2017-01-08 16:02:56 UTC
Permalink
HI Siegmar,

Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.

Howard


2017-01-07 2:30 GMT-07:00 Siegmar Gross <
Post by Siegmar Gross
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Siegmar Gross
2017-01-09 07:59:00 UTC
Permalink
Hi Howard,

I use the following commands to build and install the package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
Linux machine.

mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

../openmpi-2.0.2rc3/configure \
--prefix=/usr/local/openmpi-2.0.2_64_cc \
--libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
CPP="cpp" CXXCPP="cpp" \
--enable-mpi-cxx \
--enable-mpi-cxx-bindings \
--enable-cxx-exceptions \
--enable-mpi-java \
--enable-heterogeneous \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc


I get a different error if I run the program with gdb.

loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.

Parent process 0 running on loki
I create 4 slave processes

Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:13586] *** and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
(gdb) bt
No stack.
(gdb)

Do you need anything else?


Kind regards

Siegmar
Post by Howard Pritchard
HI Siegmar,
Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.
Howard
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Howard Pritchard
2017-01-09 16:23:03 UTC
Permalink
HI Siegmar,

You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.

This should be enough info for now,

Thanks,

Howard


2017-01-09 0:59 GMT-07:00 Siegmar Gross <
Post by Siegmar Gross
Hi Howard,
I use the following commands to build and install the package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
Linux machine.
mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.0.2rc3/configure \
--prefix=/usr/local/openmpi-2.0.2_64_cc \
--libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
CPP="cpp" CXXCPP="cpp" \
--enable-mpi-cxx \
--enable-mpi-cxx-bindings \
--enable-cxx-exceptions \
--enable-mpi-java \
--enable-heterogeneous \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
I get a different error if I run the program with gdb.
loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.h
tml>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
<http://bugs.opensuse.org/>.
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host
loki --slot-list 0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install
glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.
Parent process 0 running on loki
I create 4 slave processes
Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/o
pal/mca/pmix/base/pmix_base_fns.c at line 193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:13586] *** and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install
libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3
.x86_64
(gdb) bt
No stack.
(gdb)
Do you need anything else?
Kind regards
Siegmar
Post by Howard Pritchard
HI Siegmar,
Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.
Howard
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-01-10 16:20:17 UTC
Permalink
I think there is some relevant discussion here: https://github.com/open-mpi/ompi/issues/1569 <https://github.com/open-mpi/ompi/issues/1569>

It looks like Gilles had (at least at one point) a fix for master when enable-heterogeneous, but I don’t know if that was committed.
Post by Howard Pritchard
HI Siegmar,
You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.
This should be enough info for now,
Thanks,
Howard
Hi Howard,
I use the following commands to build and install the package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
Linux machine.
mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.0.2rc3/configure \
--prefix=/usr/local/openmpi-2.0.2_64_cc \
--libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
CPP="cpp" CXXCPP="cpp" \
--enable-mpi-cxx \
--enable-mpi-cxx-bindings \
--enable-cxx-exceptions \
--enable-mpi-java \
--enable-heterogeneous \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
I get a different error if I run the program with gdb.
loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
<http://bugs.opensuse.org/ <http://bugs.opensuse.org/>>.
<http://www.gnu.org/software/gdb/documentation/ <http://www.gnu.org/software/gdb/documentation/>>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.
Parent process 0 running on loki
I create 4 slave processes
Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:13586] *** and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
(gdb) bt
No stack.
(gdb)
Do you need anything else?
Kind regards
Siegmar
HI Siegmar,
Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.
Howard
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-01-11 09:04:28 UTC
Permalink
Siegmar,

I was able to reproduce the issue on my vm
(No need for a real heterogeneous cluster here)

I will keep digging tomorrow.
Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a very unfriendly error message.
Right now, the 4th spawn'ed task crashes, so this is a different issue

Cheers,

Gilles
I think there is some relevant discussion here: https://github.com/open-mpi/ompi/issues/1569
It looks like Gilles had (at least at one point) a fix for master when enable-heterogeneous, but I don’t know if that was committed.
HI Siegmar,
You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.
This should be enough info for now,
Thanks,
Howard
Hi Howard,
I use the following commands to build and install the package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
Linux machine.
mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.0.2rc3/configure \
  --prefix=/usr/local/openmpi-2.0.2_64_cc \
  --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
  JAVA_HOME=/usr/local/jdk1.8.0_66 \
  LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
  CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
  CPP="cpp" CXXCPP="cpp" \
  --enable-mpi-cxx \
  --enable-mpi-cxx-bindings \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --enable-heterogeneous \
  --enable-mpi-thread-multiple \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-m64 -mt" \
  --with-wrapper-cxxflags="-m64" \
  --with-wrapper-fcflags="-m64" \
  --with-wrapper-ldflags="-mt" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
I get a different error if I run the program with gdb.
loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
<http://bugs.opensuse.org/>.
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.
Parent process 0 running on loki
  I create 4 slave processes
Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:13586] ***    and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
(gdb) bt
No stack.
(gdb)
Do you need anything else?
Kind regards
Siegmar
HI Siegmar,
Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.
Howard
    Hi,
    I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
    Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
    I still get the same error that I reported for rc2.
    I would be grateful, if somebody can fix the problem before
    releasing the final version. Thank you very much for any help
    in advance.
    Kind regards
    Siegmar
    _______________________________________________
    users mailing list
    https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Siegmar Gross
2017-01-11 09:52:50 UTC
Permalink
Hi Gilles,

thank you very much for your help. What does incorrect slot list
mean? My machine has two 6-core processors so that I specified
"--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
allowed to specify more slots than available, to specify fewer
slots than available, or to specify more slots than needed for
the processes?


Kind regards

Siegmar
Post by Howard Pritchard
Siegmar,
I was able to reproduce the issue on my vm
(No need for a real heterogeneous cluster here)
I will keep digging tomorrow.
Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a very unfriendly error message.
Right now, the 4th spawn'ed task crashes, so this is a different issue
Cheers,
Gilles
I think there is some relevant discussion here: https://github.com/open-mpi/ompi/issues/1569
It looks like Gilles had (at least at one point) a fix for master when enable-heterogeneous, but I don’t know if that was committed.
Post by Howard Pritchard
HI Siegmar,
You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.
This should be enough info for now,
Thanks,
Howard
Hi Howard,
I use the following commands to build and install the package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
Linux machine.
mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.0.2rc3/configure \
--prefix=/usr/local/openmpi-2.0.2_64_cc \
--libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
CPP="cpp" CXXCPP="cpp" \
--enable-mpi-cxx \
--enable-mpi-cxx-bindings \
--enable-cxx-exceptions \
--enable-mpi-java \
--enable-heterogeneous \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
I get a different error if I run the program with gdb.
loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
<http://bugs.opensuse.org/>.
<http://www.gnu.org/software/gdb/documentation/ <http://www.gnu.org/software/gdb/documentation/>>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.
Parent process 0 running on loki
I create 4 slave processes
Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line 193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[loki:13586] *** and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3.x86_64
(gdb) bt
No stack.
(gdb)
Do you need anything else?
Kind regards
Siegmar
HI Siegmar,
Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.
Howard
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-01-11 11:39:02 UTC
Permalink
Siegmar,

Your slot list is correct.
An invalid slot list for your node would be 0:1-7,1:0-7

/* and since the test requires only 5 tasks, that could even work with such
an invalid list.
My vm is single socket with 4 cores, so a 0:0-4 slot list results in an
unfriendly pmix error */

Bottom line, your test is correct, and there is a bug in v2.0.x that I will
investigate from tomorrow

Cheers,

Gilles

On Wednesday, January 11, 2017, Siegmar Gross <
Post by Siegmar Gross
Hi Gilles,
thank you very much for your help. What does incorrect slot list
mean? My machine has two 6-core processors so that I specified
"--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't
allowed to specify more slots than available, to specify fewer
slots than available, or to specify more slots than needed for
the processes?
Kind regards
Siegmar
Post by Howard Pritchard
Siegmar,
I was able to reproduce the issue on my vm
(No need for a real heterogeneous cluster here)
I will keep digging tomorrow.
Note that if you specify an incorrect slot list, MPI_Comm_spawn fails
with a very unfriendly error message.
Right now, the 4th spawn'ed task crashes, so this is a different issue
Cheers,
Gilles
https://github.com/open-mpi/ompi/issues/1569
It looks like Gilles had (at least at one point) a fix for master when
enable-heterogeneous, but I don’t know if that was committed.
Post by Howard Pritchard
HI Siegmar,
You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.
This should be enough info for now,
Thanks,
Howard
Hi Howard,
I use the following commands to build and install the package.
${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
Linux machine.
mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.0.2rc3/configure \
--prefix=/usr/local/openmpi-2.0.2_64_cc \
--libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
CPP="cpp" CXXCPP="cpp" \
--enable-mpi-cxx \
--enable-mpi-cxx-bindings \
--enable-cxx-exceptions \
--enable-mpi-java \
--enable-heterogeneous \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.0.2_64_cc.old
mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
I get a different error if I run the program with gdb.
loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
http://gnu.org/licenses/gpl.html <http://gnu.org/licenses/gpl.html>>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
<http://bugs.opensuse.org/>.
<http://www.gnu.org/software/gdb/documentation/ <
http://www.gnu.org/software/gdb/documentation/>>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/openmpi-2.0.2_64_cc
/bin/mpiexec...done.
(gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1
--host loki --slot-list 0:0-5,1:0-5 spawn_master
Missing separate debuginfos, use: zypper install
glibc-debuginfo-2.24-2.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3b97700 (LWP 13582)]
[New Thread 0x7ffff18a4700 (LWP 13583)]
[New Thread 0x7ffff10a3700 (LWP 13584)]
[New Thread 0x7fffebbba700 (LWP 13585)]
Detaching after fork from child process 13586.
Parent process 0 running on loki
I create 4 slave processes
Detaching after fork from child process 13589.
Detaching after fork from child process 13590.
Detaching after fork from child process 13591.
[loki:13586] OPAL ERROR: Timeout in file
../../../../openmpi-2.0.2rc3/opal/mca/pmix/base/pmix_base_fns.c at line
193
[loki:13586] *** An error occurred in MPI_Comm_spawn
[loki:13586] *** reported by process [2873294849,0]
[loki:13586] *** on communicator MPI_COMM_WORLD
[loki:13586] *** MPI_ERR_UNKNOWN: unknown error
[loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[loki:13586] *** and potentially your MPI job)
[Thread 0x7fffebbba700 (LWP 13585) exited]
[Thread 0x7ffff10a3700 (LWP 13584) exited]
[Thread 0x7ffff18a4700 (LWP 13583) exited]
[Thread 0x7ffff3b97700 (LWP 13582) exited]
[Inferior 1 (process 13567) exited with code 016]
Missing separate debuginfos, use: zypper install
libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3
.x86_64
(gdb) bt
No stack.
(gdb)
Do you need anything else?
Kind regards
Siegmar
HI Siegmar,
Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.
Howard
2017-01-07 2:30 GMT-07:00 Siegmar Gross <
informatik.hs-fulda.de>
Hi,
I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
I still get the same error that I reported for rc2.
I would be grateful, if somebody can fix the problem before
releasing the final version. Thank you very much for any help
in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
https://rfd.newmexicoconsortium.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <
https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...