Discussion:
[OMPI users] Fwd: OpenMPI 3.1.0 on aarch64
r***@open-mpi.org
2018-06-07 12:56:10 UTC
Permalink
You didn’t show your srun direct launch cmd line or what version of Slurm is being used (and how it was configured), so I can only provide some advice. If you want to use PMIx, then you have to do two things:

1. Slurm must be configured to use PMIx - depending on the version, that might be there by default in the rpm

2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to the cmd line - you should check that)

If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure OMPI --with-pmi=<path-to-those-libraries>

Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu
uname -m = aarch64
uname -r = 4.11.0-45.4.1.el7a.aarch64
uname -s = Linux
uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run with
mpirun using mpirun. The processes running on the first node of a two
node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am taking
installation parameters, test files, and job parameters from the
working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the --with-verbs
from the configure line, though, and I thought that would be
sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-07 14:41:30 UTC
Permalink
Thanks, Ralph,

I just tried it with

srun --mpi=pmix_v2 ./test_mpi

and got these messages


srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file
pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------


Just to be complete, I checked the library path,


$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so


and libpmi* does appear there.


I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both

$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib

and

$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib


I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.

I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.

-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are using a PMIx v2 library
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI.
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
Let me know if anything else would be helpful.
Thanks, -- bennet
Post by r***@open-mpi.org
1. Slurm must be configured to use PMIx - depending on the version, that might be there by default in the rpm
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to the cmd line - you should check that)
If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure OMPI --with-pmi=<path-to-those-libraries>
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu
uname -m = aarch64
uname -r = 4.11.0-45.4.1.el7a.aarch64
uname -s = Linux
uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run with
mpirun using mpirun. The processes running on the first node of a two
node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am taking
installation parameters, test files, and job parameters from the
working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the --with-verbs
from the configure line, though, and I thought that would be
sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-06-07 15:05:30 UTC
Permalink
Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file
pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are using a PMIx v2 library
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI.
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
Let me know if anything else would be helpful.
Thanks, -- bennet
Post by r***@open-mpi.org
1. Slurm must be configured to use PMIx - depending on the version, that might be there by default in the rpm
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to the cmd line - you should check that)
If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure OMPI --with-pmi=<path-to-those-libraries>
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu
uname -m = aarch64
uname -r = 4.11.0-45.4.1.el7a.aarch64
uname -s = Linux
uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run with
mpirun using mpirun. The processes running on the first node of a two
node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am taking
installation parameters, test files, and job parameters from the
working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the --with-verbs
from the configure line, though, and I thought that would be
sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-07 19:45:13 UTC
Permalink
I rebuilt and examined the logs more closely. There was a warning
about a failure with the external hwloc, and that led to finding that
the CentOS hwloc-devel package was not installed.

I also added the options that we have been using for a while,
--disable-dlopen and --enable-shared, to the configure line, and now
it runs with srun, both in its bare form

srun ./test_mpi

and specifying the pmix type

srun --mpi=pmix_v2 ./test_mpi

I will probably try to figure out whether either of the two configure
options were the culprit. I have a suspicion that there might have
been a cascade of errors from the missing hwloc, however.

-- bennet
Bennet Fauber
2018-06-08 15:10:02 UTC
Permalink
Further testing shows that it was the failure to find the hwloc-devel files
that seems to be the cause of the failure. I compiled and ran without the
additional configure flags, and it still seems to work.

I think it issued a two-line warning about this. Is that something that
should result in an error if --with-hwloc=external is specified but not
found? Just a thought.

My immediate problem is solved. Thanks very much Ralph and Artem for your
help!

-- bennet
Post by Artem Polyakov
Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file
pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
Post by Bennet Fauber
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
Post by Bennet Fauber
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
Post by Bennet Fauber
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
Post by Bennet Fauber
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are using
a PMIx v2 library
Post by Bennet Fauber
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building
OpenMPI.
Post by Bennet Fauber
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27
/opt/slurm/lib64/slurm/mpi_none.so
Post by Bennet Fauber
-rwxr-xr-x 1 root root 257240 May 30 15:27
/opt/slurm/lib64/slurm/mpi_openmpi.so
Post by Bennet Fauber
-rwxr-xr-x 1 root root 668808 May 30 15:27
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 841312 May 30 15:27
/opt/slurm/lib64/slurm/mpi_pmix_v2.so
Post by Bennet Fauber
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_none.so
Post by Bennet Fauber
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_openmpi.so
Post by Bennet Fauber
-rwxr-xr-x 1 root root 622848 May 30 15:20
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 828232 May 30 15:20
/opt/slurm/lib64/slurm/mpi_pmix_v2.so
Post by Bennet Fauber
Let me know if anything else would be helpful.
Thanks, -- bennet
You didn’t show your srun direct launch cmd line or what version of
Slurm is being used (and how it was configured), so I can only provide some
Post by Bennet Fauber
1. Slurm must be configured to use PMIx - depending on the version,
that might be there by default in the rpm
Post by Bennet Fauber
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi
pmix to the cmd line - you should check that)
Post by Bennet Fauber
If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to
configure OMPI --with-pmi=<path-to-those-libraries>
Post by Bennet Fauber
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu
uname -m = aarch64
uname -r = 4.11.0-45.4.1.el7a.aarch64
uname -s = Linux
uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run
with
Post by Bennet Fauber
mpirun using mpirun. The processes running on the first node of a
two
Post by Bennet Fauber
node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid
Post by Bennet Fauber
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid
Post by Bennet Fauber
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
However, when I run it using srun directly, I get the following
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
Post by Bennet Fauber
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
Post by Bennet Fauber
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Bennet Fauber
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am
taking
Post by Bennet Fauber
installation parameters, test files, and job parameters from the
working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the
--with-verbs
Post by Bennet Fauber
from the configure line, though, and I thought that would be
sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-06-08 15:16:00 UTC
Permalink
Further testing shows that it was the failure to find the hwloc-devel files that seems to be the cause of the failure. I compiled and ran without the additional configure flags, and it still seems to work.
I think it issued a two-line warning about this. Is that something that should result in an error if --with-hwloc=external is specified but not found? Just a thought.
Yes - that is a bug in our configury. It should have immediately error’d out.
My immediate problem is solved. Thanks very much Ralph and Artem for your help!
-- bennet
Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286 <http://cav02.arc-ts.umich.edu:92286>] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92286 <http://cav02.arc-ts.umich.edu:92286>] OPAL ERROR: Error in file
pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287 <http://cav02.arc-ts.umich.edu:92287>] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92287 <http://cav02.arc-ts.umich.edu:92287>] OPAL ERROR: Error in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are using a PMIx v2 library
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI.
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
Let me know if anything else would be helpful.
Thanks, -- bennet
Post by r***@open-mpi.org
1. Slurm must be configured to use PMIx - depending on the version, that might be there by default in the rpm
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to the cmd line - you should check that)
If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure OMPI --with-pmi=<path-to-those-libraries>
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu <http://cavium-hpc.arc-ts.umich.edu/>
uname -m = aarch64
uname -r = 4.11.0-45.4.1.el7a.aarch64
uname -s = Linux
uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run with
mpirun using mpirun. The processes running on the first node of a two
node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681 <http://10.242.15.36:58681/>"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681 <http://10.242.15.36:58681/>"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828 <http://cav02.arc-ts.umich.edu:19828/>] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828 <http://cav02.arc-ts.umich.edu:19828/>] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am taking
installation parameters, test files, and job parameters from the
working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the --with-verbs
from the configure line, though, and I thought that would be
sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres) via users
2018-06-08 15:30:17 UTC
Permalink
Hmm. I'm confused -- can we clarify?

I just tried configuring Open MPI v3.1.0 on a RHEL 7.4 system with the RHEL hwloc RPM installed, but *not* the hwloc-devel RPM. Hence, no hwloc.h (for example).

When specifying an external hwloc, configure did fail, as expected:

-----
$ ./configure --with-hwloc=external ...
...

+++ Configuring MCA framework hwloc
checking for no configure components in framework hwloc...
checking for m4 configure components in framework hwloc... external, hwloc1117

--- MCA component hwloc:external (m4 configuration macro, priority 90)
checking for MCA component hwloc:external compile mode... static
checking --with-hwloc-libdir value... simple ok (unspecified value)
checking looking for external hwloc in... (default search paths)
checking hwloc.h usability... no
checking hwloc.h presence... no
checking for hwloc.h... no
checking if MCA component hwloc:external can compile... no
configure: WARNING: MCA component "external" failed to configure properly
configure: WARNING: This component was selected as the default
configure: error: Cannot continue
$
---

Are you seeing something different?
Further testing shows that it was the failure to find the hwloc-devel files that seems to be the cause of the failure. I compiled and ran without the additional configure flags, and it still seems to work.
I think it issued a two-line warning about this. Is that something that should result in an error if --with-hwloc=external is specified but not found? Just a thought.
Yes - that is a bug in our configury. It should have immediately error’d out.
My immediate problem is solved. Thanks very much Ralph and Artem for your help!
-- bennet
Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file
pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are using a PMIx v2 library
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI.
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
Let me know if anything else would be helpful.
Thanks, -- bennet
Post by r***@open-mpi.org
1. Slurm must be configured to use PMIx - depending on the version, that might be there by default in the rpm
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to the cmd line - you should check that)
If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure OMPI --with-pmi=<path-to-those-libraries>
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu
uname -m = aarch64
uname -r = 4.11.0-45.4.1.el7a.aarch64
uname -s = Linux
uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run with
mpirun using mpirun. The processes running on the first node of a two
node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages,
and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am taking
installation parameters, test files, and job parameters from the
working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the --with-verbs
from the configure line, though, and I thought that would be
sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Bennet Fauber
2018-06-08 15:38:09 UTC
Permalink
Jeff,

Hmm. Maybe I had insufficient error checking in our installation process.

Can you make and make install after the configure fails? I somehow got an
installation, despite the configure status, perhaps?

-- bennet




On Fri, Jun 8, 2018 at 11:32 AM Jeff Squyres (jsquyres) via users <
Post by Jeff Squyres (jsquyres) via users
Hmm. I'm confused -- can we clarify?
I just tried configuring Open MPI v3.1.0 on a RHEL 7.4 system with the
RHEL hwloc RPM installed, but *not* the hwloc-devel RPM. Hence, no hwloc.h
(for example).
-----
$ ./configure --with-hwloc=external ...
...
+++ Configuring MCA framework hwloc
checking for no configure components in framework hwloc...
checking for m4 configure components in framework hwloc... external, hwloc1117
--- MCA component hwloc:external (m4 configuration macro, priority 90)
checking for MCA component hwloc:external compile mode... static
checking --with-hwloc-libdir value... simple ok (unspecified value)
checking looking for external hwloc in... (default search paths)
checking hwloc.h usability... no
checking hwloc.h presence... no
checking for hwloc.h... no
checking if MCA component hwloc:external can compile... no
configure: WARNING: MCA component "external" failed to configure properly
configure: WARNING: This component was selected as the default
configure: error: Cannot continue
$
---
Are you seeing something different?
Post by r***@open-mpi.org
Post by Bennet Fauber
Further testing shows that it was the failure to find the hwloc-devel
files that seems to be the cause of the failure. I compiled and ran
without the additional configure flags, and it still seems to work.
Post by r***@open-mpi.org
Post by Bennet Fauber
I think it issued a two-line warning about this. Is that something
that should result in an error if --with-hwloc=external is specified but
not found? Just a thought.
Post by r***@open-mpi.org
Yes - that is a bug in our configury. It should have immediately error’d
out.
Post by r***@open-mpi.org
Post by Bennet Fauber
My immediate problem is solved. Thanks very much Ralph and Artem for
your help!
Post by r***@open-mpi.org
Post by Bennet Fauber
-- bennet
Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file
pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file
pmix2x_client.c at line 109
--------------------------------------------------------------------------
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are
using a PMIx v2 library
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building
OpenMPI.
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27
/opt/slurm/lib64/slurm/mpi_none.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
-rwxr-xr-x 1 root root 257240 May 30 15:27
/opt/slurm/lib64/slurm/mpi_openmpi.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
-rwxr-xr-x 1 root root 668808 May 30 15:27
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 841312 May 30 15:27
/opt/slurm/lib64/slurm/mpi_pmix_v2.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_none.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_openmpi.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
-rwxr-xr-x 1 root root 622848 May 30 15:20
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 828232 May 30 15:20
/opt/slurm/lib64/slurm/mpi_pmix_v2.so
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
Let me know if anything else would be helpful.
Thanks, -- bennet
You didn’t show your srun direct launch cmd line or what version
of Slurm is being used (and how it was configured), so I can only provide
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
1. Slurm must be configured to use PMIx - depending on the
version, that might be there by default in the rpm
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
2. you have to tell srun to use the pmix plugin (IIRC you add
--mpi pmix to the cmd line - you should check that)
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to
configure OMPI --with-pmi=<path-to-those-libraries>
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from
.rpm.
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu
uname -m = aarch64
uname -r = 4.11.0-45.4.1.el7a.aarch64
uname -s = Linux
uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then
reports
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run
with
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
mpirun using mpirun. The processes running on the first node of
a two
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2"
-mca
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
"3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2"
-mca
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
"3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
However, when I run it using srun directly, I get the following
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in
file
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
pmix2x_client.c at line 109
--------------------------------------------------------------------------
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore
cannot
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi
pointing
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT
completed
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
completed successfully, but am not able to aggregate error
messages,
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am
taking
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
installation parameters, test files, and job parameters from the
working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the
--with-verbs
Post by r***@open-mpi.org
Post by Bennet Fauber
Post by Bennet Fauber
from the configure line, though, and I thought that would be
sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Jeff Squyres (jsquyres) via users
2018-06-08 15:44:01 UTC
Permalink
Post by Bennet Fauber
Hmm. Maybe I had insufficient error checking in our installation process.
Can you make and make install after the configure fails? I somehow got an installation, despite the configure status, perhaps?
If it's a fresh tarball expansion that you've never built before, no (because there will be no Makefiles, etc.).

If you've previously built that tree before, then configure may fail, but you can still run "make clean all install" because the stale Makefiles (etc.) are still around.
--
Jeff Squyres
***@cisco.com
Artem Polyakov
2018-06-08 04:46:36 UTC
Permalink
Hello, Bennet.

One odd thing that I see in the error output that you have provided is that pmix2x_client.c is active.
Looking into the v3.1.x branch (https://github.com/open-mpi/ompi/tree/v3.1.x/opal/mca/pmix) I see the following components:
* ext1x
* ext2x
...
*pmix2x

Pmix2x_client is in internal pmix2x component that shouldn't be built if external ext2x component was configured. At least it was the case before.
According to the output it fails on PMIx_Init().
Can you please do "$ ls mca_pmix_*" in the <ompi-prefix>/lib/openmpi directory?

Another thing that caught my eye: you say that OMPI searches of PMIx 3.x:
...
Post by Bennet Fauber
Post by r***@open-mpi.org
It fails on the test for PMIx 3, which is expected, but then
reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
But OMPI v3.1.x doesn't have such a component. Can you provide the related lines from config.log?

Now about debugging of what is happening:
1. I'd like to see results with PMIx debug on:
$ env PMIX_DEBUG=100 srun --mpi=pmix_v2 ...

2. Can you set SlurmdDebug option in slurm.conf to 10, run the test and provide the content of slurmd.log?


Today's Topics:

1. Re: Fwd: OpenMPI 3.1.0 on aarch64 (***@open-mpi.org)

----------------------------------------------------------------------

Message: 1
Date: Thu, 7 Jun 2018 08:05:30 -0700
From: "***@open-mpi.org" <***@open-mpi.org>
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Fwd: OpenMPI 3.1.0 on aarch64
Message-ID: <B96E575B-51E4-47B7-996D-***@open-mpi.org>
Content-Type: text/plain; charset=utf-8

Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234 [cav02.arc-ts.umich.edu:92286] OPAL
ERROR: Error in file pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234 [cav02.arc-ts.umich.edu:92287] OPAL
ERROR: Error in file pmix2x_client.c at line 109
----------------------------------------------------------------------
---- The application appears to have been direct launched using
"srun", but OMPI was not built with SLURM's PMI support and therefore
cannot execute. There are several options for building PMI support
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
----------------------------------------------------------------------
----
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Gener
ic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch
64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-uti
ls/lib
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0
/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-l
inux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib
:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are
using a PMIx v2 library
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI.
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27
/opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27
/opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1
root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1
root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
Let me know if anything else would be helpful.
Thanks, -- bennet
Post by r***@open-mpi.org
1. Slurm must be configured to use PMIx - depending on the version,
that might be there by default in the rpm
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi
pmix to the cmd line - you should check that)
If your intent was to use Slurm?s PMI-1 or PMI-2, then you need to
configure OMPI --with-pmi=<path-to-those-libraries>
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was generated by
GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu uname -m = aarch64 uname -r
= 4.11.0-45.4.1.el7a.aarch64 uname -s = Linux uname -v = #1 SMP
Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then
reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run
with mpirun using mpirun. The processes running on the first node
of a two node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid "3609657344" -mca ess_base_vpid "1" -mca
orte_hnp_uri "3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid "3609657344" -mca ess_base_vpid "1" -mca
orte_hnp_uri "3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
------------------------------------------------------------------
-------- The application appears to have been direct launched
using "srun", but OMPI was not built with SLURM's PMI support and
therefore cannot execute. There are several options for building
PMI support under SLURM, depending upon the SLURM version you are
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi
pointing to the SLURM PMI library location.
Please configure as appropriate and try again.
------------------------------------------------------------------
--------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate
error messages, and not able to guarantee that all other processes were killed!
Using the same scheme to set this up on x86_64 worked, and I am
taking installation parameters, test files, and job parameters
from the working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the
--with-verbs from the configure line, though, and I thought that
would be sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Car
temp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7
d2e4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFT
eHL3YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
lists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Carte
mp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e
4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3
YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
ists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp
%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9
ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElv
HpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
sts.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%4
0mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6
a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWH
gUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
ts.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%40m
ellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6a4d
149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWHgUFK
DtOqXekIBpmEF8L43Jfo%3D&reserved=0
------------------------------

Subject: Digest Footer

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0

------------------------------

End of users Digest, Vol 4059, Issue 2
**************************************
Bennet Fauber
2018-06-08 11:53:48 UTC
Permalink
Hi, Artem,

Thanks for the reply. I'll answer a couple of questions inline below.

One odd thing that I see in the error output that you have provided is that
pmix2x_client.c is active.
Post by Artem Polyakov
Looking into the v3.1.x branch (
https://github.com/open-mpi/ompi/tree/v3.1.x/opal/mca/pmix) I see the
* ext1x
* ext2x
...
*pmix2x
Pmix2x_client is in internal pmix2x component that shouldn't be built if
external ext2x component was configured. At least it was the case before.
According to the output it fails on PMIx_Init().
Can you please do "$ ls mca_pmix_*" in the <ompi-prefix>/lib/openmpi directory?
$ ls mca_pmix*
mca_pmix_flux.la mca_pmix_isolated.la mca_pmix_pmix2x.la
mca_pmix_flux.so mca_pmix_isolated.so mca_pmix_pmix2x.so
Post by Artem Polyakov
...
Post by Bennet Fauber
It fails on the test for PMIx 3, which is expected, but then
reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
But OMPI v3.1.x doesn't have such a component. Can you provide the related
lines from config.log?
Here are the relevant lines.

configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
configure:12812: checking version 3x
configure:12830: gcc -E -I/opt/pmix/2.0.2/include conftest.c
conftest.c:95:56: error: #error "not version 3"

I believe that is a red herring. Some time in the past, I was told that
there is an anticipatory test for pmix3, and since there isn't such a
thing, this is expected to fail.
Post by Artem Polyakov
$ env PMIX_DEBUG=100 srun --mpi=pmix_v2 ...
Here is that output, which seems little changed from what was before. I
include only that from the first communicator, as it repeats almost
verbatim for the others.

srun: Step created for job 99
[cav02.arc-ts.umich.edu:41373] psec: native init
[cav02.arc-ts.umich.edu:41373] psec: none init
[cav02.arc-ts.umich.edu:41374] psec: native init
[cav02.arc-ts.umich.edu:41374] psec: none init
[cav02.arc-ts.umich.edu:41373] pmix: init called
[cav02.arc-ts.umich.edu:41373] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:41373] OPAL ERROR: Error in file pmix2x_client.c at
line 109
[cav02.arc-ts.umich.edu:41374] pmix: init called
[cav02.arc-ts.umich.edu:41374] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:41374] OPAL ERROR: Error in file pmix2x_client.c at
line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
--------------------------------------------------------------------------


The second through fourth also have a line about

[cav02.arc-ts.umich.edu:41373] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages, and
not able to guarantee that all other processes were killed!
Post by Artem Polyakov
2. Can you set SlurmdDebug option in slurm.conf to 10, run the test and
provide the content of slurmd.log?
I will reply separately with this, as I have to coordinate with the cluster
administrator, who is not in yet.

Please note, also, that I was able to build this successfully after install
the hwlock-devel package and adding the --disable-dlopen and
--enable-shared options to configure.

Thanks, -- bennet
Post by Artem Polyakov
----------------------------------------------------------------------
Message: 1
Date: Thu, 7 Jun 2018 08:05:30 -0700
Subject: Re: [OMPI users] Fwd: OpenMPI 3.1.0 on aarch64
Content-Type: text/plain; charset=utf-8
Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234 [cav02.arc-ts.umich.edu:92286] OPAL
ERROR: Error in file pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234 [cav02.arc-ts.umich.edu:92287] OPAL
ERROR: Error in file pmix2x_client.c at line 109
----------------------------------------------------------------------
---- The application appears to have been direct launched using
"srun", but OMPI was not built with SLURM's PMI support and therefore
cannot execute. There are several options for building PMI support
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
----------------------------------------------------------------------
----
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Gener
ic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch
64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-uti
ls/lib
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0
/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-l
inux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib
:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are
using a PMIx v2 library
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building
OpenMPI.
Post by Bennet Fauber
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27
/opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27
/opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1
root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1
root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
Let me know if anything else would be helpful.
Thanks, -- bennet
You didn?t show your srun direct launch cmd line or what version of
Slurm is being used (and how it was configured), so I can only provide some
Post by Bennet Fauber
1. Slurm must be configured to use PMIx - depending on the version,
that might be there by default in the rpm
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi
pmix to the cmd line - you should check that)
If your intent was to use Slurm?s PMI-1 or PMI-2, then you need to
configure OMPI --with-pmi=<path-to-those-libraries>
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was generated by
GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu uname -m = aarch64 uname -r
= 4.11.0-45.4.1.el7a.aarch64 uname -s = Linux uname -v = #1 SMP
Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then
reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run
with mpirun using mpirun. The processes running on the first node
of a two node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid "3609657344" -mca ess_base_vpid "1" -mca
orte_hnp_uri "3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid "3609657344" -mca ess_base_vpid "1" -mca
orte_hnp_uri "3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
However, when I run it using srun directly, I get the following
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
------------------------------------------------------------------
-------- The application appears to have been direct launched
using "srun", but OMPI was not built with SLURM's PMI support and
therefore cannot execute. There are several options for building
PMI support under SLURM, depending upon the SLURM version you are
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi
pointing to the SLURM PMI library location.
Please configure as appropriate and try again.
------------------------------------------------------------------
--------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Bennet Fauber
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate
error messages, and not able to guarantee that all other processes
were killed!
Post by Bennet Fauber
Using the same scheme to set this up on x86_64 worked, and I am
taking installation parameters, test files, and job parameters
from the working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the
--with-verbs from the configure line, though, and I thought that
would be sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Car
temp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7
d2e4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFT
eHL3YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
lists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Carte
mp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e
4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3
YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
ists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp
%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9
ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElv
HpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
sts.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%4
0mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6
a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWH
gUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
ts.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%40m
ellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6a4d
149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWHgUFK
DtOqXekIBpmEF8L43Jfo%3D&reserved=0
------------------------------
Subject: Digest Footer
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
------------------------------
End of users Digest, Vol 4059, Issue 2
**************************************
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-06-08 14:56:28 UTC
Permalink
Artem,

Please find attached the gzipped slurmd.log with the entries from the
failed job's run.

-- bennet
Post by Bennet Fauber
Hi, Artem,
Thanks for the reply. I'll answer a couple of questions inline below.
One odd thing that I see in the error output that you have provided is
that pmix2x_client.c is active.
Post by Artem Polyakov
Looking into the v3.1.x branch (
https://github.com/open-mpi/ompi/tree/v3.1.x/opal/mca/pmix) I see the
* ext1x
* ext2x
...
*pmix2x
Pmix2x_client is in internal pmix2x component that shouldn't be built if
external ext2x component was configured. At least it was the case before.
According to the output it fails on PMIx_Init().
Can you please do "$ ls mca_pmix_*" in the <ompi-prefix>/lib/openmpi directory?
$ ls mca_pmix*
mca_pmix_flux.la mca_pmix_isolated.la mca_pmix_pmix2x.la
mca_pmix_flux.so mca_pmix_isolated.so mca_pmix_pmix2x.so
Post by Artem Polyakov
...
Post by Bennet Fauber
It fails on the test for PMIx 3, which is expected, but then
reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
But OMPI v3.1.x doesn't have such a component. Can you provide the
related lines from config.log?
Here are the relevant lines.
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
configure:12812: checking version 3x
configure:12830: gcc -E -I/opt/pmix/2.0.2/include conftest.c
conftest.c:95:56: error: #error "not version 3"
I believe that is a red herring. Some time in the past, I was told that
there is an anticipatory test for pmix3, and since there isn't such a
thing, this is expected to fail.
Post by Artem Polyakov
$ env PMIX_DEBUG=100 srun --mpi=pmix_v2 ...
Here is that output, which seems little changed from what was before. I
include only that from the first communicator, as it repeats almost
verbatim for the others.
srun: Step created for job 99
[cav02.arc-ts.umich.edu:41373] psec: native init
[cav02.arc-ts.umich.edu:41373] psec: none init
[cav02.arc-ts.umich.edu:41374] psec: native init
[cav02.arc-ts.umich.edu:41374] psec: none init
[cav02.arc-ts.umich.edu:41373] pmix: init called
[cav02.arc-ts.umich.edu:41373] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:41373] OPAL ERROR: Error in file pmix2x_client.c
at line 109
[cav02.arc-ts.umich.edu:41374] pmix: init called
[cav02.arc-ts.umich.edu:41374] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234
[cav02.arc-ts.umich.edu:41374] OPAL ERROR: Error in file pmix2x_client.c
at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
--------------------------------------------------------------------------
The second through fourth also have a line about
[cav02.arc-ts.umich.edu:41373] Local abort before MPI_INIT completed
completed successfully, but am not able to aggregate error messages, and
not able to guarantee that all other processes were killed!
Post by Artem Polyakov
2. Can you set SlurmdDebug option in slurm.conf to 10, run the test and
provide the content of slurmd.log?
I will reply separately with this, as I have to coordinate with the
cluster administrator, who is not in yet.
Please note, also, that I was able to build this successfully after
install the hwlock-devel package and adding the --disable-dlopen and
--enable-shared options to configure.
Thanks, -- bennet
Post by Artem Polyakov
----------------------------------------------------------------------
Message: 1
Date: Thu, 7 Jun 2018 08:05:30 -0700
Subject: Re: [OMPI users] Fwd: OpenMPI 3.1.0 on aarch64
Content-Type: text/plain; charset=utf-8
Odd - Artem, do you have any suggestions?
Post by Bennet Fauber
Thanks, Ralph,
I just tried it with
srun --mpi=pmix_v2 ./test_mpi
and got these messages
srun: Step created for job 89
[cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234 [cav02.arc-ts.umich.edu:92286] OPAL
ERROR: Error in file pmix2x_client.c at line 109
[cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
client/pmix_client.c at line 234 [cav02.arc-ts.umich.edu:92287] OPAL
ERROR: Error in file pmix2x_client.c at line 109
----------------------------------------------------------------------
---- The application appears to have been direct launched using
"srun", but OMPI was not built with SLURM's PMI support and therefore
cannot execute. There are several options for building PMI support
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
----------------------------------------------------------------------
----
Just to be complete, I checked the library path,
$ ldconfig -p | egrep 'slurm|pmix'
libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
and libpmi* does appear there.
I also tried explicitly listing the slurm directory from the slurm
library installation in LD_LIBRARY_PATH, just in case it wasn't
traversing correctly. that is, both
$ echo $LD_LIBRARY_PATH
/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Gener
ic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch
64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-uti
ls/lib
and
$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0
/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-l
inux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib
:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
I don't have a saved build log, but I can rebuild this and save the
build logs, in case any information in those logs would help.
I will also mention that we have, in the past, used the
--disable-dlopen and --enable-shared flags, which we did not use here.
Just in case that makes any difference.
-- bennet
I think you need to set your MPIDefault to pmix_v2 since you are
using a PMIx v2 library
Hi, Ralph,
Thanks for the reply, and sorry for the missing information. I hope
this fills in the picture better.
$ srun --version
slurm 17.11.7
$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix
We have pmix configured as the default in /opt/slurm/etc/slurm.conf
MpiDefault=pmix
and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.
I have tried all of the following srun variations with no joy
srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi
I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation
$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
-ba pmix-2.0.2.spec
$ rpmbuild --define '_prefix /opt/slurm' \
--define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
-ta slurm-17.11.7.tar.bz2
I did use the '--with-pmix=/opt/pmix/2.0.2' option when building
OpenMPI.
Post by Bennet Fauber
In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*
-rwxr-xr-x 1 root root 257288 May 30 15:27
/opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27
/opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1
root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
and on the x86_64, where it runs, we have a comparable list,
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20
/opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20
/opt/slurm/lib64/slurm/mpi_pmi2.so
Post by Bennet Fauber
lrwxrwxrwx 1 root root 16 Jun 1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1
root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so
Let me know if anything else would be helpful.
Thanks, -- bennet
You didn?t show your srun direct launch cmd line or what version of
Slurm is being used (and how it was configured), so I can only provide some
Post by Bennet Fauber
1. Slurm must be configured to use PMIx - depending on the version,
that might be there by default in the rpm
2. you have to tell srun to use the pmix plugin (IIRC you add --mpi
pmix to the cmd line - you should check that)
If your intent was to use Slurm?s PMI-1 or PMI-2, then you need to
configure OMPI --with-pmi=<path-to-those-libraries>
Ralph
We are trying out MPI on an aarch64 cluster.
Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
configure flags shown in this snippet from the top of config.log
It was created by Open MPI configure 3.1.0, which was generated by
GNU Autoconf 2.69. Invocation command line was
$ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
## --------- ##
## Platform. ##
## --------- ##
hostname = cavium-hpc.arc-ts.umich.edu uname -m = aarch64 uname -r
= 4.11.0-45.4.1.el7a.aarch64 uname -s = Linux uname -v = #1 SMP
Fri Feb 2 17:11:57 UTC 2018
/usr/bin/uname -p = aarch64
It checks for pmi and reports it found,
configure:12680: checking if user requested external PMIx
support(/opt/pmix/2.0.2)
configure:12690: result: yes
configure:12701: checking --with-external-pmix value
configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
configure:12794: checking PMIx version
configure:12804: result: version file found
It fails on the test for PMIx 3, which is expected, but then
reports
configure:12843: checking version 2x
configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c
configure:12861: $? = 0
configure:12862: result: found
I have a small, test MPI program that I run, and it runs when run
with mpirun using mpirun. The processes running on the first node
of a two node job are
bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi
bennet 20346 20340 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid "3609657344" -mca ess_base_vpid "1" -mca
orte_hnp_uri "3609657344.0;tcp://10.242.15.36:58681"
bennet 20347 20346 0 08:04 ? 00:00:00 srun
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
--nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
ess_base_jobid "3609657344" -mca ess_base_vpid "1" -mca
orte_hnp_uri "3609657344.0;tcp://10.242.15.36:58681"
bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi
bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi
However, when I run it using srun directly, I get the following
srun: Step created for job 87
[cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
pmix2x_client.c at line 109
------------------------------------------------------------------
-------- The application appears to have been direct launched
using "srun", but OMPI was not built with SLURM's PMI support and
therefore cannot execute. There are several options for building
PMI support under SLURM, depending upon the SLURM version you are
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi
pointing to the SLURM PMI library location.
Please configure as appropriate and try again.
------------------------------------------------------------------
--------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
Post by Bennet Fauber
*** and potentially your MPI job)
[cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate
error messages, and not able to guarantee that all other processes
were killed!
Post by Bennet Fauber
Using the same scheme to set this up on x86_64 worked, and I am
taking installation parameters, test files, and job parameters
from the working x86_64 installation.
Other than the architecture, the main difference between the two
clusters is that the aarch64 has only ethernet networking, whereas
there is infiniband on the x86_64 cluster. I removed the
--with-verbs from the configure line, though, and I thought that
would be sufficient.
Anyone have suggestions what might be wrong, how to fix it, or for
further diagnostics?
Thank you, -- bennet
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Car
temp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7
d2e4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFT
eHL3YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2F
lists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Carte
mp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e
4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3
YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
ists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp
%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9
ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElv
HpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fli
sts.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%4
0mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6
a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWH
gUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis
ts.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%40m
ellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6a4d
149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWHgUFK
DtOqXekIBpmEF8L43Jfo%3D&reserved=0
------------------------------
Subject: Digest Footer
_______________________________________________
users mailing list
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cartemp%40mellanox.com%7Ca94f8437c0f147e8631a08d5cca0b3db%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636639912948859600&sdata=p6aGuFTeHL3YElvHpWHgUFKDtOqXekIBpmEF8L43Jfo%3D&reserved=0
------------------------------
End of users Digest, Vol 4059, Issue 2
**************************************
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...