Hammond, Simon David via users
2018-06-16 23:45:07 UTC
Hi OpenMPI Team,
We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
OpenMPI 3.1.0 Configuration:
./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Output:
make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>
Output from Top:
20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]
We have recently updated an install of OpenMPI on POWER9 system (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code than ran before is now locking up and making no progress, getting stuck in wait-all operations. While I think it's prudent for us to root cause this a little more, I have gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears to hang forever. I am not sure if this is the cause of our issue but wanted to report that we are seeing this on our system.
OpenMPI 3.1.0 Configuration:
./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Output:
make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
<runs forever>
Output from Top:
20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo
--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]