Discussion:
[OMPI users] OpenMPI2 + slurm
Lothar Brendel
2018-11-23 08:17:00 UTC
Permalink
Hi guys,

I've always been somewhat at a loss regarding slurm's idea about tasks vs. jobs. That didn't cause any problems, though, until passing to OpenMPI2 (2.0.2 that is, with slurm 16.05.9).

Running http://mpitutorial.com/tutorials/mpi-hello-world as an example with just

srun -n 2 MPI-hellow

yields

Hello world from processor node31, rank 0 out of 1 processors
Hello world from processor node31, rank 0 out of 1 processors

i.e. the two tasks don't see each other MPI-wise. Well, srun doesn't include an mpirun.

But running

srun -n 2 mpirun MPI-hellow

produces

Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors

i.e. I get *two* independent MPI-tasks with 2 processors each. (The same applies if stating explicitly "mpirun -np 2".)
I never could make sense of this squaring, I rather used to run my jobs like

srun -c 2 mpirun -np 2 MPI-hellow

which provided the desired job with *one* task using 2 processors. Passing from OpenMPI 1.6.5 to 2.0.2 (Debian Jessie -> Stretch), though, I'm getting the error
"There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
MPI-hellow" now.

The environment on the node contains

SLURM_CPUS_ON_NODE=2
SLURM_CPUS_PER_TASK=2
SLURM_JOB_CPUS_PER_NODE=2
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1

which looks fine to me, but mpirun infers slots=1 from that (confirmed by ras_base_verbose 5). In deed, looking into orte/mca/ras/slurm/ras_slurm_module.c, I find that while orte_ras_slurm_allocate() reads the value of SLURM_CPUS_PER_TASK into its local variable cpus_per_task, it doesn't use it anywhere. Rather, the number of slots is determined from SLURM_TASKS_PER_NODE.

Is this intended behaviour?

What's wrong here? I know that I can use --oversubscribe, but that seems rather a workaround.

Thanks in advance,
Lothar
Gilles Gouaillardet
2018-11-23 10:31:24 UTC
Permalink
Lothar,

it seems you did not configure Open MPI with --with-pmi=<path to SLURM's PMI>

If SLURM was built with PMIx support, then an other option is to use that.
First, srun --mpi=list will show you the list of available MPI
modules, and then you could
srun --mpi=pmix_v2 ... MPI_Hellow
If you believe that should be the default, then you should contact
your sysadmin that can make that for you.

You you want to use PMIx, then I recommend you configure Open MPI with
the same external PMIx that was used to
build SLURM (e.g. configure --with-pmix=<path to PMIx>). Though PMIx
has cross version support, using the same PMIx will avoid you running
incompatible PMIx versions.


Cheers,

Gilles
On Fri, Nov 23, 2018 at 5:20 PM Lothar Brendel
Post by Lothar Brendel
Hi guys,
I've always been somewhat at a loss regarding slurm's idea about tasks vs. jobs. That didn't cause any problems, though, until passing to OpenMPI2 (2.0.2 that is, with slurm 16.05.9).
Running http://mpitutorial.com/tutorials/mpi-hello-world as an example with just
srun -n 2 MPI-hellow
yields
Hello world from processor node31, rank 0 out of 1 processors
Hello world from processor node31, rank 0 out of 1 processors
i.e. the two tasks don't see each other MPI-wise. Well, srun doesn't include an mpirun.
But running
srun -n 2 mpirun MPI-hellow
produces
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
i.e. I get *two* independent MPI-tasks with 2 processors each. (The same applies if stating explicitly "mpirun -np 2".)
I never could make sense of this squaring, I rather used to run my jobs like
srun -c 2 mpirun -np 2 MPI-hellow
which provided the desired job with *one* task using 2 processors. Passing from OpenMPI 1.6.5 to 2.0.2 (Debian Jessie -> Stretch), though, I'm getting the error
"There are not enough slots available in the system to satisfy the 2 slots
MPI-hellow" now.
The environment on the node contains
SLURM_CPUS_ON_NODE=2
SLURM_CPUS_PER_TASK=2
SLURM_JOB_CPUS_PER_NODE=2
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
which looks fine to me, but mpirun infers slots=1 from that (confirmed by ras_base_verbose 5). In deed, looking into orte/mca/ras/slurm/ras_slurm_module.c, I find that while orte_ras_slurm_allocate() reads the value of SLURM_CPUS_PER_TASK into its local variable cpus_per_task, it doesn't use it anywhere. Rather, the number of slots is determined from SLURM_TASKS_PER_NODE.
Is this intended behaviour?
What's wrong here? I know that I can use --oversubscribe, but that seems rather a workaround.
Thanks in advance,
Lothar
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Ralph H Castain
2018-11-23 14:36:19 UTC
Permalink
Post by Gilles Gouaillardet
Post by Lothar Brendel
srun -n 2 mpirun MPI-hellow
tells srun to launch two copies of mpirun, each of which is to run as many processes as there are slots assigned to the allocation. srun will get an allocation of two slots, and so you’ll get two concurrent MPI jobs, each consisting of two procs.
Post by Gilles Gouaillardet
Post by Lothar Brendel
srun -c 2 mpirun -np 2 MPI-hellow
told srun to get two slots but only run one copy (the default value of the -n option) of mpirun, and you told mpirun to launch two procs. So you got one job consisting of two procs.

What you probably want to do is what Gilles advised. However, Slurm 16.05 only supports PMIx v1, so you’d want to download and build PMIx v1.2.5, and then build Slurm against it. OMPI v2.0.2 may have a slightly older copy of PMIx in it (I honestly don’t remember) - to be safe, it would be best to configure OMPI to use the 1.2.5 you installed for Slurm. You’ll also be required to build OMPI against an external copy of libevent and hwloc to ensure OMPI is linked against the same versions used by PMIx.

Or you can just build OMPI against the Slurm PMI library - up to you.

Ralph
Post by Gilles Gouaillardet
Lothar,
it seems you did not configure Open MPI with --with-pmi=<path to SLURM's PMI>
If SLURM was built with PMIx support, then an other option is to use that.
First, srun --mpi=list will show you the list of available MPI
modules, and then you could
srun --mpi=pmix_v2 ... MPI_Hellow
If you believe that should be the default, then you should contact
your sysadmin that can make that for you.
You you want to use PMIx, then I recommend you configure Open MPI with
the same external PMIx that was used to
build SLURM (e.g. configure --with-pmix=<path to PMIx>). Though PMIx
has cross version support, using the same PMIx will avoid you running
incompatible PMIx versions.
Cheers,
Gilles
On Fri, Nov 23, 2018 at 5:20 PM Lothar Brendel
Post by Lothar Brendel
Hi guys,
I've always been somewhat at a loss regarding slurm's idea about tasks vs. jobs. That didn't cause any problems, though, until passing to OpenMPI2 (2.0.2 that is, with slurm 16.05.9).
Running http://mpitutorial.com/tutorials/mpi-hello-world as an example with just
srun -n 2 MPI-hellow
yields
Hello world from processor node31, rank 0 out of 1 processors
Hello world from processor node31, rank 0 out of 1 processors
i.e. the two tasks don't see each other MPI-wise. Well, srun doesn't include an mpirun.
But running
srun -n 2 mpirun MPI-hellow
produces
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
i.e. I get *two* independent MPI-tasks with 2 processors each. (The same applies if stating explicitly "mpirun -np 2".)
I never could make sense of this squaring, I rather used to run my jobs like
srun -c 2 mpirun -np 2 MPI-hellow
which provided the desired job with *one* task using 2 processors. Passing from OpenMPI 1.6.5 to 2.0.2 (Debian Jessie -> Stretch), though, I'm getting the error
"There are not enough slots available in the system to satisfy the 2 slots
MPI-hellow" now.
The environment on the node contains
SLURM_CPUS_ON_NODE=2
SLURM_CPUS_PER_TASK=2
SLURM_JOB_CPUS_PER_NODE=2
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
which looks fine to me, but mpirun infers slots=1 from that (confirmed by ras_base_verbose 5). In deed, looking into orte/mca/ras/slurm/ras_slurm_module.c, I find that while orte_ras_slurm_allocate() reads the value of SLURM_CPUS_PER_TASK into its local variable cpus_per_task, it doesn't use it anywhere. Rather, the number of slots is determined from SLURM_TASKS_PER_NODE.
Is this intended behaviour?
What's wrong here? I know that I can use --oversubscribe, but that seems rather a workaround.
Thanks in advance,
Lothar
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...