Discussion:
[OMPI users] OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2
Hammond, Simon David via users
2018-06-16 23:45:07 UTC
Permalink
Hi OpenMPI Team,

We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.

OpenMPI 3.1.0 Configuration:

./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs

GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)

Output:

make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>

Output from Top:

20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Hammond, Simon David via users
2018-06-16 23:48:47 UTC
Permalink
The output from the test in question is:

Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>

S.

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]


On 6/16/18, 5:45 PM, "Hammond, Simon David" <***@sandia.gov> wrote:

Hi OpenMPI Team,

We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.

OpenMPI 3.1.0 Configuration:

./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs

GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)

Output:

make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>

Output from Top:

20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Nathan Hjelm
2018-06-17 04:09:56 UTC
Permalink
Try the latest nightly tarball for v3.1.x. Should be fixed.
Post by Hammond, Simon David via users
Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Hi OpenMPI Team,
We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>
20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Hammond, Simon David via users
2018-06-30 19:18:13 UTC
Permalink
Nathan,

Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.

S.

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]


On 6/16/18, 10:10 PM, "Nathan Hjelm" <***@me.com> wrote:

Try the latest nightly tarball for v3.1.x. Should be fixed.
Post by Hammond, Simon David via users
Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Hi OpenMPI Team,
We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>
20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres) via users
2018-07-02 14:48:42 UTC
Permalink
Simon --

You don't currently have another Open MPI installation in your PATH / LD_LIBRARY_PATH, do you?

I have seen dependency library loads cause "make check" to get confused, and instead of loading the libraries from the build tree, actually load some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from an installation tree. Hilarity ensues (to include symptoms such as running forever).

Can you double check that you have no Open MPI libraries in your LD_LIBRARY_PATH before running "make check" on the build tree?
Post by Hammond, Simon David via users
Nathan,
Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Try the latest nightly tarball for v3.1.x. Should be fixed.
Post by Hammond, Simon David via users
Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Hi OpenMPI Team,
We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>
20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Howard Pritchard
2018-07-02 19:27:16 UTC
Permalink
HI Si,

Could you add --disable-builtin-atomics

to the configure options and see if the hang goes away?

Howard


2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users <
Post by Jeff Squyres (jsquyres) via users
Simon --
You don't currently have another Open MPI installation in your PATH /
LD_LIBRARY_PATH, do you?
I have seen dependency library loads cause "make check" to get confused,
and instead of loading the libraries from the build tree, actually load
some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from
an installation tree. Hilarity ensues (to include symptoms such as running
forever).
Can you double check that you have no Open MPI libraries in your
LD_LIBRARY_PATH before running "make check" on the build tree?
On Jun 30, 2018, at 3:18 PM, Hammond, Simon David via users <
Nathan,
Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Try the latest nightly tarball for v3.1.x. Should be fixed.
On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users <
Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Hi OpenMPI Team,
We have recently updated an install of OpenMPI on POWER9 system
(configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1.
We seem to have a symptom where code than ran before is now locking up and
making no progress, getting stuck in wait-all operations. While I think
it's prudent for us to root cause this a little more, I have gone back and
rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears
to hang forever. I am not sure if this is the cause of our issue but wanted
to report that we are seeing this on our system.
./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.
1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java
--enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.
1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA
for POWER9 (standard download from their website). We enable IBM's JDK
8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
make[3]: Entering directory `/home/sdhammo/openmpi/
openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/
openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>
20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Hammond, Simon David via users
2018-07-02 20:19:25 UTC
Permalink
Howard,

This fixed the issue with OpenMPI 3.1.0. Do you want me to try the same with 3.1.1 as well?

S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA


From: users <users-***@lists.open-mpi.org> on behalf of Howard Pritchard <***@gmail.com>
Reply-To: Open MPI Users <***@lists.open-mpi.org>
Date: Monday, July 2, 2018 at 1:34 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

HI Si,

Could you add --disable-builtin-atomics

to the configure options and see if the hang goes away?

Howard


2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users <***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>>:
Simon --

You don't currently have another Open MPI installation in your PATH / LD_LIBRARY_PATH, do you?

I have seen dependency library loads cause "make check" to get confused, and instead of loading the libraries from the build tree, actually load some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from an installation tree. Hilarity ensues (to include symptoms such as running forever).

Can you double check that you have no Open MPI libraries in your LD_LIBRARY_PATH before running "make check" on the build tree?
Post by Hammond, Simon David via users
Nathan,
Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Try the latest nightly tarball for v3.1.x. Should be fixed.
Post by Hammond, Simon David via users
Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
Hi OpenMPI Team,
We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>
20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com<mailto:***@cisco.com>

_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
Nathan Hjelm
2018-07-02 20:32:11 UTC
Permalink
The result should be the same with v3.1.1. I will investigate on our Coral test systems.

-Nathan

On Jul 02, 2018, at 02:23 PM, "Hammond, Simon David via users" <***@lists.open-mpi.org> wrote:

Howard,

 

This fixed the issue with OpenMPI 3.1.0. Do you want me to try the same with 3.1.1 as well?

 

S.

 

-- 

Si Hammond

Scalable Computer Architectures

Sandia National Laboratories, NM, USA

 

 

From: users <users-***@lists.open-mpi.org> on behalf of Howard Pritchard <***@gmail.com>
Reply-To: Open MPI Users <***@lists.open-mpi.org>
Date: Monday, July 2, 2018 at 1:34 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

 

HI Si,

 

Could you add --disable-builtin-atomics

 

to the configure options and see if the hang goes away?

 

Howard

 

 

2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users <***@lists.open-mpi.org>:

Simon --

You don't currently have another Open MPI installation in your PATH / LD_LIBRARY_PATH, do you?

I have seen dependency library loads cause "make check" to get confused, and instead of loading the libraries from the build tree, actually load some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from an installation tree.  Hilarity ensues (to include symptoms such as running forever).

Can you double check that you have no Open MPI libraries in your LD_LIBRARY_PATH before running "make check" on the build tree?
Post by Hammond, Simon David via users
Nathan,
Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
    Try the latest nightly tarball for v3.1.x. Should be fixed.
Post by Hammond, Simon David via users
Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
   Hi OpenMPI Team,
   We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
   ./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
   GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
   RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
   make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
   make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
   PASS: ompi_rb_tree
   PASS: opal_bitmap
   PASS: opal_hash_table
   PASS: opal_proc_table
   PASS: opal_tree
   PASS: opal_list
   PASS: opal_value_array
   PASS: opal_pointer_array
   PASS: opal_lifo
   <runs forever>
   20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
   --
   Si Hammond
   Scalable Computer Architectures
   Sandia National Laboratories, NM, USA
   [Sent from remote connection, excuse typos]
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com


_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

 
Nathan Hjelm via users
2018-07-03 21:47:09 UTC
Permalink
Found this issue. PR #5374 fixes it. Will make its way into the v3.0.x and v3.1.x release series.

-Nathan

On Jul 02, 2018, at 02:36 PM, Nathan Hjelm <***@me.com> wrote:


The result should be the same with v3.1.1. I will investigate on our Coral test systems.

-Nathan

On Jul 02, 2018, at 02:23 PM, "Hammond, Simon David via users" <***@lists.open-mpi.org> wrote:

Howard,

 

This fixed the issue with OpenMPI 3.1.0. Do you want me to try the same with 3.1.1 as well?

 

S.

 

-- 

Si Hammond

Scalable Computer Architectures

Sandia National Laboratories, NM, USA

 

 

From: users <users-***@lists.open-mpi.org> on behalf of Howard Pritchard <***@gmail.com>
Reply-To: Open MPI Users <***@lists.open-mpi.org>
Date: Monday, July 2, 2018 at 1:34 PM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

 

HI Si,

 

Could you add --disable-builtin-atomics

 

to the configure options and see if the hang goes away?

 

Howard

 

 

2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users <***@lists.open-mpi.org>:

Simon --

You don't currently have another Open MPI installation in your PATH / LD_LIBRARY_PATH, do you?

I have seen dependency library loads cause "make check" to get confused, and instead of loading the libraries from the build tree, actually load some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from an installation tree.  Hilarity ensues (to include symptoms such as running forever).

Can you double check that you have no Open MPI libraries in your LD_LIBRARY_PATH before running "make check" on the build tree?
Post by Hammond, Simon David via users
Nathan,
Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
    Try the latest nightly tarball for v3.1.x. Should be fixed.
Post by Hammond, Simon David via users
Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
<then runs forever>
S.
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
   Hi OpenMPI Team,
   We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
   ./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
   GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
   RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
   make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
   make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
   PASS: ompi_rb_tree
   PASS: opal_bitmap
   PASS: opal_hash_table
   PASS: opal_proc_table
   PASS: opal_tree
   PASS: opal_list
   PASS: opal_value_array
   PASS: opal_pointer_array
   PASS: opal_lifo
   <runs forever>
   20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
   --
   Si Hammond
   Scalable Computer Architectures
   Sandia National Laboratories, NM, USA
   [Sent from remote connection, excuse typos]
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com


_______________________________________________
users mailing list
***@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

 
Loading...