Lothar Brendel
2018-11-23 08:17:00 UTC
Hi guys,
I've always been somewhat at a loss regarding slurm's idea about tasks vs. jobs. That didn't cause any problems, though, until passing to OpenMPI2 (2.0.2 that is, with slurm 16.05.9).
Running http://mpitutorial.com/tutorials/mpi-hello-world as an example with just
srun -n 2 MPI-hellow
yields
Hello world from processor node31, rank 0 out of 1 processors
Hello world from processor node31, rank 0 out of 1 processors
i.e. the two tasks don't see each other MPI-wise. Well, srun doesn't include an mpirun.
But running
srun -n 2 mpirun MPI-hellow
produces
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
i.e. I get *two* independent MPI-tasks with 2 processors each. (The same applies if stating explicitly "mpirun -np 2".)
I never could make sense of this squaring, I rather used to run my jobs like
srun -c 2 mpirun -np 2 MPI-hellow
which provided the desired job with *one* task using 2 processors. Passing from OpenMPI 1.6.5 to 2.0.2 (Debian Jessie -> Stretch), though, I'm getting the error
"There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
MPI-hellow" now.
The environment on the node contains
SLURM_CPUS_ON_NODE=2
SLURM_CPUS_PER_TASK=2
SLURM_JOB_CPUS_PER_NODE=2
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
which looks fine to me, but mpirun infers slots=1 from that (confirmed by ras_base_verbose 5). In deed, looking into orte/mca/ras/slurm/ras_slurm_module.c, I find that while orte_ras_slurm_allocate() reads the value of SLURM_CPUS_PER_TASK into its local variable cpus_per_task, it doesn't use it anywhere. Rather, the number of slots is determined from SLURM_TASKS_PER_NODE.
Is this intended behaviour?
What's wrong here? I know that I can use --oversubscribe, but that seems rather a workaround.
Thanks in advance,
Lothar
I've always been somewhat at a loss regarding slurm's idea about tasks vs. jobs. That didn't cause any problems, though, until passing to OpenMPI2 (2.0.2 that is, with slurm 16.05.9).
Running http://mpitutorial.com/tutorials/mpi-hello-world as an example with just
srun -n 2 MPI-hellow
yields
Hello world from processor node31, rank 0 out of 1 processors
Hello world from processor node31, rank 0 out of 1 processors
i.e. the two tasks don't see each other MPI-wise. Well, srun doesn't include an mpirun.
But running
srun -n 2 mpirun MPI-hellow
produces
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
Hello world from processor node31, rank 1 out of 2 processors
Hello world from processor node31, rank 0 out of 2 processors
i.e. I get *two* independent MPI-tasks with 2 processors each. (The same applies if stating explicitly "mpirun -np 2".)
I never could make sense of this squaring, I rather used to run my jobs like
srun -c 2 mpirun -np 2 MPI-hellow
which provided the desired job with *one* task using 2 processors. Passing from OpenMPI 1.6.5 to 2.0.2 (Debian Jessie -> Stretch), though, I'm getting the error
"There are not enough slots available in the system to satisfy the 2 slots
that were requested by the application:
MPI-hellow" now.
The environment on the node contains
SLURM_CPUS_ON_NODE=2
SLURM_CPUS_PER_TASK=2
SLURM_JOB_CPUS_PER_NODE=2
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
which looks fine to me, but mpirun infers slots=1 from that (confirmed by ras_base_verbose 5). In deed, looking into orte/mca/ras/slurm/ras_slurm_module.c, I find that while orte_ras_slurm_allocate() reads the value of SLURM_CPUS_PER_TASK into its local variable cpus_per_task, it doesn't use it anywhere. Rather, the number of slots is determined from SLURM_TASKS_PER_NODE.
Is this intended behaviour?
What's wrong here? I know that I can use --oversubscribe, but that seems rather a workaround.
Thanks in advance,
Lothar