Discussion:
[OMPI users] Openmpi-3.1.0 + slurm?
Bill Broadley
2018-05-09 00:56:55 UTC
Permalink
I have openmpi-3.0.1, pmix-1.2.4, and slurm-17.11.5 working well on a few
clusters. For things like:

***@headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1
c7-18 c7-19
size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec

I've been having a tougher time trying to get openmpi-3.1, (external)
pmix-2.1.1, and slurm-17.11.5 working. Anyone have similar working?

I compiled them both with:

./configure --prefix=/share/apps/openmpi-3.1.0/gcc7
--with-pmix=/share/apps/pmix-2.1.1/gcc7 --with-libevent=external
--disable-io-romio --disable-io-ompio

./configure --prefix=/share/apps/slurm-17.11.5/gcc7
--with-pmix=/share/apps/pmix-2.1.1/gcc7

Both config.log's look promising. No pmix related errors, and variables being
set including the PMIX discovered flags. I did notice that the working openmpi
configs had:
#define OPAL_PMIX_V1 1

But the nonworking openmpi config had:
#define OPAL_PMIX_V1 0

Although it's not too surprising since I'm trying to compile and link against
pmix-2.1.1.

The other relevant env variables set by the configure:
OPAL_CONFIGURE_CLI=' \'\''--prefix=/share/apps/openmpi-3.1.0/gcc7\'\''
\'\''--with-pmix=/share/apps/pmix-2.1.1/gcc7\'\''
\'\''--with-libevent=external\'\'' \'\''--disable-io-romio\'\''
\'\''--disable-io-ompio\'\'''
opal_pmix_ext1x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include'
opal_pmix_ext1x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib'
opal_pmix_ext1x_LIBS='-lpmix'
opal_pmix_ext2x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include'
opal_pmix_ext2x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib'

Any hints on how to debug this?

When I try to run:
***@demon:~/relay$ mpicc -O3 relay.c -o relay
***@demon:~/relay$ srun -N 2 -n 2 ./relay 1
[c2-50:01318] OPAL ERROR: Not initialized in file ext2x_client.c at line 109
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[c2-50:01318] Local abort before MPI_INIT completed completed successfully, but
am not able to aggregate error messages, and not able to guarantee that all
other processes were killed!
Bill Broadley
2018-05-09 02:02:25 UTC
Permalink
Sorry all,

Chris S over on the slurm list spotted it right away. I didn't have the
MpiDefault set to pmix_v2.

I can confirm that Ubuntu 18.04, gcc-7.3, openmpi-3.1.0, pmix-2.1.1, and
slurm-17.11.5 seem to work well together.

Sorry for the bother.
r***@open-mpi.org
2018-05-09 02:23:04 UTC
Permalink
Good news - thanks!
Post by Bill Broadley
Sorry all,
Chris S over on the slurm list spotted it right away. I didn't have the
MpiDefault set to pmix_v2.
I can confirm that Ubuntu 18.04, gcc-7.3, openmpi-3.1.0, pmix-2.1.1, and
slurm-17.11.5 seem to work well together.
Sorry for the bother.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...