Discussion:
[OMPI users] slurm configuration override mpirun command line process mapping
Nicolas Deladerriere
2018-05-16 06:50:27 UTC
Permalink
Hi all,



I am trying to run mpi application through SLURM job scheduler. Here is my
running sequence


sbatch --> my_env_script.sh --> my_run_script.sh --> mpirun


In order to minimize modification of my production environment, I had to
setup following hostlist management in different scripts:


*my_env_script.sh*


build host list from SLURM resource manager information

Example: node01 nslots=2 ; node02 nslots=2 ; node03 nslots=2


*my_run_script.sh*


Build host list according to required job (process mapping depends on job
requirement).

Nodes are always fully dedicated to my job, but I have to manage different
master-slave situation with corresponding mpirun command:

- as many process as number of slots:

*mpirun -H node01 -np 1 process_master.x : -H node02,node02,node03,node03
-np 4 process_slave.x*

- only one process per node (slots are usually used through openMP
threading)

*mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x*



However, I realized that whatever I specified through my mpirun command,
process mapping is overridden at run time by slurm according to slurm
setting (either default setting or sbatch command line). For example, if I
run with:


*sbatch -N 3 --exclusive my_env_script.sh myjob*


where final mpirun command (depending on myjob) is:


*mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x*


It will be run with process mapping corresponding to:


*mpirun -H node01 -np 1 other_process_master.x : -H node02,node02 -np 2
other_process_slave.x*


So far I did not find a way to force mpirun to run with host mapping from
command line instead of slurm one. Is there a way to do it (either by using
MCA parameters of slurm configuration or 
) ?


openmpi version : 1.6.5

slurm version : 17.11.2



Ragards,

Nicolas


Note 1: I know, it would be better to let slurm manage my process mapping
by only using slurm parameters and not specifying host mapping in my mpirun
command, but in order to minimize modification in my production environment
I had to use such solution.

Note 2: I know I am using old openmpi version !
Gilles Gouaillardet
2018-05-16 06:58:34 UTC
Permalink
You can try to disable SLURM :

mpirun --mca ras ^slurm --mca plm ^slurm --mca ess ^slurm,slurmd ...

That will require you are able to SSH between compute nodes.
Keep in mind this is far form ideal since it might leave some MPI
processes on nodes if you cancel a job, and mess SLURM accounting too.


Cheers,

Gilles

On Wed, May 16, 2018 at 3:50 PM, Nicolas Deladerriere
Post by Nicolas Deladerriere
Hi all,
I am trying to run mpi application through SLURM job scheduler. Here is my
running sequence
sbatch --> my_env_script.sh --> my_run_script.sh --> mpirun
In order to minimize modification of my production environment, I had to
my_env_script.sh
build host list from SLURM resource manager information
Example: node01 nslots=2 ; node02 nslots=2 ; node03 nslots=2
my_run_script.sh
Build host list according to required job (process mapping depends on job
requirement).
Nodes are always fully dedicated to my job, but I have to manage different
mpirun -H node01 -np 1 process_master.x : -H node02,node02,node03,node03 -np
4 process_slave.x
only one process per node (slots are usually used through openMP threading)
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
However, I realized that whatever I specified through my mpirun command,
process mapping is overridden at run time by slurm according to slurm
setting (either default setting or sbatch command line). For example, if I
sbatch -N 3 --exclusive my_env_script.sh myjob
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
mpirun -H node01 -np 1 other_process_master.x : -H node02,node02 -np 2
other_process_slave.x
So far I did not find a way to force mpirun to run with host mapping from
command line instead of slurm one. Is there a way to do it (either by using
MCA parameters of slurm configuration or …) ?
openmpi version : 1.6.5
slurm version : 17.11.2
Ragards,
Nicolas
Note 1: I know, it would be better to let slurm manage my process mapping by
only using slurm parameters and not specifying host mapping in my mpirun
command, but in order to minimize modification in my production environment
I had to use such solution.
Note 2: I know I am using old openmpi version !
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-05-16 07:47:05 UTC
Permalink
The problem here is that you have made an incorrect assumption. In the older OMPI versions, the -H option simply indicated that the specified hosts were available for use - it did not imply the number of slots on that host. Since you have specified 2 slots on each host, and you told mpirun to launch 2 procs of your second app_context (the “slave”), it filled the first node with the 2 procs.

I don’t recall the options for that old a version, but IIRC you should add --pernode to the cmd line to get exactly 1 proc/node

Or upgrade to a more recent OMPI version where -H can also be used to specify the #slots on a node :-)
Post by Gilles Gouaillardet
mpirun --mca ras ^slurm --mca plm ^slurm --mca ess ^slurm,slurmd ...
That will require you are able to SSH between compute nodes.
Keep in mind this is far form ideal since it might leave some MPI
processes on nodes if you cancel a job, and mess SLURM accounting too.
Cheers,
Gilles
On Wed, May 16, 2018 at 3:50 PM, Nicolas Deladerriere
Post by Nicolas Deladerriere
Hi all,
I am trying to run mpi application through SLURM job scheduler. Here is my
running sequence
sbatch --> my_env_script.sh --> my_run_script.sh --> mpirun
In order to minimize modification of my production environment, I had to
my_env_script.sh
build host list from SLURM resource manager information
Example: node01 nslots=2 ; node02 nslots=2 ; node03 nslots=2
my_run_script.sh
Build host list according to required job (process mapping depends on job
requirement).
Nodes are always fully dedicated to my job, but I have to manage different
mpirun -H node01 -np 1 process_master.x : -H node02,node02,node03,node03 -np
4 process_slave.x
only one process per node (slots are usually used through openMP threading)
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
However, I realized that whatever I specified through my mpirun command,
process mapping is overridden at run time by slurm according to slurm
setting (either default setting or sbatch command line). For example, if I
sbatch -N 3 --exclusive my_env_script.sh myjob
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
mpirun -H node01 -np 1 other_process_master.x : -H node02,node02 -np 2
other_process_slave.x
So far I did not find a way to force mpirun to run with host mapping from
command line instead of slurm one. Is there a way to do it (either by using
MCA parameters of slurm configuration or …) ?
openmpi version : 1.6.5
slurm version : 17.11.2
Ragards,
Nicolas
Note 1: I know, it would be better to let slurm manage my process mapping by
only using slurm parameters and not specifying host mapping in my mpirun
command, but in order to minimize modification in my production environment
I had to use such solution.
Note 2: I know I am using old openmpi version !
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
r***@open-mpi.org
2018-05-17 12:33:53 UTC
Permalink
mpirun takes the #slots for each node from the slurm allocation. Your hostfile (at least, what you provided) retained that information and shows 2 slots on each node. So both the original allocation _and_ your constructed hostfile are both telling mpirun to assign 2 slots on each node.

Like I said before, on this old version, -H doesn’t say anything about #slots - that information is coming solely from the original allocation and your hostfile.
In my case, I do not specify number of slots by node to openmpi (see mpirun command just above). From what I see the only place I define number of slots in this case is actually through SLURM configuration (SLURM_JOB_CPUS_PER_NODE=4(x3)). And I was not expected this to be taken when running mpi processes.
Using --bynode is probably the easiest solution in my case, even if I am scared that it will not necessary fit all my running configuration. Better solution would be to review my management script for better integration with slurm resources manager, but this is another story.
Nicolas Deladerriere
2018-05-17 13:17:56 UTC
Permalink
"mpirun takes the #slots for each node from the slurm allocation."
Yes this is my issue and what I was not expected. But I will stick with
--bynode solution.

Thanks a lot for your help.
Regards,
Nicolas
Post by r***@open-mpi.org
mpirun takes the #slots for each node from the slurm allocation. Your
hostfile (at least, what you provided) retained that information and shows
2 slots on each node. So both the original allocation _and_ your
constructed hostfile are both telling mpirun to assign 2 slots on each node.
Like I said before, on this old version, -H doesn’t say anything about
#slots - that information is coming solely from the original allocation and
your hostfile.
On May 17, 2018, at 5:11 AM, Nicolas Deladerriere <
In my case, I do not specify number of slots by node to openmpi (see
mpirun command just above). From what I see the only place I define number
of slots in this case is actually through SLURM configuration
(SLURM_JOB_CPUS_PER_NODE=4(x3)). And I was not expected this to be taken
when running mpi processes.
Using --bynode is probably the easiest solution in my case, even if I am
scared that it will not necessary fit all my running configuration. Better
solution would be to review my management script for better integration
with slurm resources manager, but this is another story.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Gilles Gouaillardet
2018-05-17 12:23:13 UTC
Permalink
Nicolas,

This looks odd at first glance, but as stated before, 1.6 is an obsolete
series.
A workaround could be to
mpirun—mca ess ...
And replace ... with a comma separated list of ess components that excludes
both slurm and slurmd.

An other workaround could be to remove SLURM related environment variables
before calling mpirun.


Cheers,

Gilles

On Thursday, May 17, 2018, Nicolas Deladerriere <
Post by Nicolas Deladerriere
Hi all,
Thanks for your feedback,
about using " mpirun --mca ras ^slurm --mca plm ^slurm --mca ess
^slurm,slurmd ...". I am a bit confused since syntax sounds good, but I
*--------------------------------------------------------------------------MCA
framework parameters can only take a single negation operator("^"), and it
must be at the beginning of the value. The followingvalue violates this
rule: env,^slurm,slurmdWhen used, the negation operator sets the
"exclusive" behavior mode,meaning that it will exclude all specified
components (and implicitlyinclude all others). ......You cannot mix
inclusive and exclusive behavior.*
Is there other mca setting that could violate command line setting ?
*/../openmpi/1.6.5/bin/mpirun -prefix /.../openmpi/1.6.5 -tag-output -H
r01n05 -x OMP_NUM_THREADS -np 1 --mca ras ^slurm --mca plm ^slurm --mca ess
^slurm,slurmd master_exe.x: -H r01n06,r01n07 -x OMP_NUM_THREADS -np 2
slave_exe.x*
*host% ompi_info --all | grep slurm MCA ras: slurm (MCA
v2.0, API v2.0, Component v1.6.5) MCA plm: slurm (MCA v2.0,
API v2.0, Component v1.6.5) MCA ess: slurm (MCA v2.0, API
v2.0, Component v1.6.5) MCA ess: slurmd (MCA v2.0, API
v2.0, Component v1.6.5) MCA ras: parameter
"ras_slurm_priority" (current value: <75>, data source: default
value) Priority of the slurm ras
component MCA plm: parameter "plm_slurm_args" (current
parameter "plm_slurm_priority" (current value: <0>, data source: default
value) MCA ess: parameter "ess_slurm_priority" (current
value: <0>, data source: default value) MCA ess: parameter
"ess_slurmd_priority" (current value: <0>, data source: default value)*
In my case, I do not specify number of slots by node to openmpi (see
mpirun command just above). From what I see the only place I define number
of slots in this case is actually through SLURM configuration
(SLURM_JOB_CPUS_PER_NODE=4(x3)). And I was not expected this to be taken
when running mpi processes.
Using --bynode is probably the easiest solution in my case, even if I am
scared that it will not necessary fit all my running configuration. Better
solution would be to review my management script for better integration
with slurm resources manager, but this is another story.
Thanks for your help.
Regards,
Nicolas
Post by r***@open-mpi.org
The problem here is that you have made an incorrect assumption. In the
older OMPI versions, the -H option simply indicated that the specified
hosts were available for use - it did not imply the number of slots on that
host. Since you have specified 2 slots on each host, and you told mpirun to
launch 2 procs of your second app_context (the “slave”), it filled the
first node with the 2 procs.
I don’t recall the options for that old a version, but IIRC you should
add --pernode to the cmd line to get exactly 1 proc/node
Or upgrade to a more recent OMPI version where -H can also be used to
specify the #slots on a node :-)
On May 15, 2018, at 11:58 PM, Gilles Gouaillardet <
mpirun --mca ras ^slurm --mca plm ^slurm --mca ess ^slurm,slurmd ...
That will require you are able to SSH between compute nodes.
Keep in mind this is far form ideal since it might leave some MPI
processes on nodes if you cancel a job, and mess SLURM accounting too.
Cheers,
Gilles
On Wed, May 16, 2018 at 3:50 PM, Nicolas Deladerriere
Post by Nicolas Deladerriere
Hi all,
I am trying to run mpi application through SLURM job scheduler. Here
is my
Post by Nicolas Deladerriere
running sequence
sbatch --> my_env_script.sh --> my_run_script.sh --> mpirun
In order to minimize modification of my production environment, I had
to
Post by Nicolas Deladerriere
my_env_script.sh
build host list from SLURM resource manager information
Example: node01 nslots=2 ; node02 nslots=2 ; node03 nslots=2
my_run_script.sh
Build host list according to required job (process mapping depends on
job
Post by Nicolas Deladerriere
requirement).
Nodes are always fully dedicated to my job, but I have to manage
different
Post by Nicolas Deladerriere
mpirun -H node01 -np 1 process_master.x : -H
node02,node02,node03,node03 -np
Post by Nicolas Deladerriere
4 process_slave.x
only one process per node (slots are usually used through openMP
threading)
Post by Nicolas Deladerriere
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
However, I realized that whatever I specified through my mpirun
command,
Post by Nicolas Deladerriere
process mapping is overridden at run time by slurm according to slurm
setting (either default setting or sbatch command line). For example,
if I
Post by Nicolas Deladerriere
sbatch -N 3 --exclusive my_env_script.sh myjob
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
mpirun -H node01 -np 1 other_process_master.x : -H node02,node02 -np 2
other_process_slave.x
So far I did not find a way to force mpirun to run with host mapping
from
Post by Nicolas Deladerriere
command line instead of slurm one. Is there a way to do it (either by
using
Post by Nicolas Deladerriere
MCA parameters of slurm configuration or 
) ?
openmpi version : 1.6.5
slurm version : 17.11.2
Ragards,
Nicolas
Note 1: I know, it would be better to let slurm manage my process
mapping by
Post by Nicolas Deladerriere
only using slurm parameters and not specifying host mapping in my
mpirun
Post by Nicolas Deladerriere
command, but in order to minimize modification in my production
environment
Post by Nicolas Deladerriere
I had to use such solution.
Note 2: I know I am using old openmpi version !
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Nicolas Deladerriere
2018-05-17 13:31:13 UTC
Permalink
Gilles,

Adding ess component that excludes slurm and slurmd.
I run into trouble about connection issue. I guess I need slurm and slurmd
in my runtime context ! Anyway, as you mentioned that not a good solution
regarding remaining mpi process when using scancel and I guess i will also
lose some process monitoring functionalities from slurm.

I will stick with mpirun command line update using --bynode option.

Thanks a lot for your help.
Regards,
Nicolas


2018-05-17 14:23 GMT+02:00 Gilles Gouaillardet <
Post by Gilles Gouaillardet
Nicolas,
This looks odd at first glance, but as stated before, 1.6 is an obsolete
series.
A workaround could be to
mpirun—mca ess ...
And replace ... with a comma separated list of ess components that
excludes both slurm and slurmd.
An other workaround could be to remove SLURM related environment variables
before calling mpirun.
Cheers,
Gilles
On Thursday, May 17, 2018, Nicolas Deladerriere <
Post by Nicolas Deladerriere
Hi all,
Thanks for your feedback,
about using " mpirun --mca ras ^slurm --mca plm ^slurm --mca ess
^slurm,slurmd ...". I am a bit confused since syntax sounds good, but I
*--------------------------------------------------------------------------MCA
framework parameters can only take a single negation operator("^"), and it
must be at the beginning of the value. The followingvalue violates this
rule: env,^slurm,slurmdWhen used, the negation operator sets the
"exclusive" behavior mode,meaning that it will exclude all specified
components (and implicitlyinclude all others). ......You cannot mix
inclusive and exclusive behavior.*
Is there other mca setting that could violate command line setting ?
*/../openmpi/1.6.5/bin/mpirun -prefix /.../openmpi/1.6.5 -tag-output -H
r01n05 -x OMP_NUM_THREADS -np 1 --mca ras ^slurm --mca plm ^slurm --mca ess
^slurm,slurmd master_exe.x: -H r01n06,r01n07 -x OMP_NUM_THREADS -np 2
slave_exe.x*
*host% ompi_info --all | grep slurm MCA ras: slurm (MCA
v2.0, API v2.0, Component v1.6.5) MCA plm: slurm (MCA v2.0,
API v2.0, Component v1.6.5) MCA ess: slurm (MCA v2.0, API
v2.0, Component v1.6.5) MCA ess: slurmd (MCA v2.0, API
v2.0, Component v1.6.5) MCA ras: parameter
"ras_slurm_priority" (current value: <75>, data source: default
value) Priority of the slurm ras
component MCA plm: parameter "plm_slurm_args" (current
parameter "plm_slurm_priority" (current value: <0>, data source: default
value) MCA ess: parameter "ess_slurm_priority" (current
value: <0>, data source: default value) MCA ess: parameter
"ess_slurmd_priority" (current value: <0>, data source: default value)*
In my case, I do not specify number of slots by node to openmpi (see
mpirun command just above). From what I see the only place I define number
of slots in this case is actually through SLURM configuration
(SLURM_JOB_CPUS_PER_NODE=4(x3)). And I was not expected this to be taken
when running mpi processes.
Using --bynode is probably the easiest solution in my case, even if I am
scared that it will not necessary fit all my running configuration. Better
solution would be to review my management script for better integration
with slurm resources manager, but this is another story.
Thanks for your help.
Regards,
Nicolas
Post by r***@open-mpi.org
The problem here is that you have made an incorrect assumption. In the
older OMPI versions, the -H option simply indicated that the specified
hosts were available for use - it did not imply the number of slots on that
host. Since you have specified 2 slots on each host, and you told mpirun to
launch 2 procs of your second app_context (the “slave”), it filled the
first node with the 2 procs.
I don’t recall the options for that old a version, but IIRC you should
add --pernode to the cmd line to get exactly 1 proc/node
Or upgrade to a more recent OMPI version where -H can also be used to
specify the #slots on a node :-)
On May 15, 2018, at 11:58 PM, Gilles Gouaillardet <
mpirun --mca ras ^slurm --mca plm ^slurm --mca ess ^slurm,slurmd ...
That will require you are able to SSH between compute nodes.
Keep in mind this is far form ideal since it might leave some MPI
processes on nodes if you cancel a job, and mess SLURM accounting too.
Cheers,
Gilles
On Wed, May 16, 2018 at 3:50 PM, Nicolas Deladerriere
Post by Nicolas Deladerriere
Hi all,
I am trying to run mpi application through SLURM job scheduler. Here
is my
Post by Nicolas Deladerriere
running sequence
sbatch --> my_env_script.sh --> my_run_script.sh --> mpirun
In order to minimize modification of my production environment, I had
to
Post by Nicolas Deladerriere
my_env_script.sh
build host list from SLURM resource manager information
Example: node01 nslots=2 ; node02 nslots=2 ; node03 nslots=2
my_run_script.sh
Build host list according to required job (process mapping depends on
job
Post by Nicolas Deladerriere
requirement).
Nodes are always fully dedicated to my job, but I have to manage
different
Post by Nicolas Deladerriere
mpirun -H node01 -np 1 process_master.x : -H
node02,node02,node03,node03 -np
Post by Nicolas Deladerriere
4 process_slave.x
only one process per node (slots are usually used through openMP
threading)
Post by Nicolas Deladerriere
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
However, I realized that whatever I specified through my mpirun
command,
Post by Nicolas Deladerriere
process mapping is overridden at run time by slurm according to slurm
setting (either default setting or sbatch command line). For example,
if I
Post by Nicolas Deladerriere
sbatch -N 3 --exclusive my_env_script.sh myjob
mpirun -H node01 -np 1 other_process_master.x : -H node02,node03 -np 2
other_process_slave.x
mpirun -H node01 -np 1 other_process_master.x : -H node02,node02 -np 2
other_process_slave.x
So far I did not find a way to force mpirun to run with host mapping
from
Post by Nicolas Deladerriere
command line instead of slurm one. Is there a way to do it (either by
using
Post by Nicolas Deladerriere
MCA parameters of slurm configuration or 
) ?
openmpi version : 1.6.5
slurm version : 17.11.2
Ragards,
Nicolas
Note 1: I know, it would be better to let slurm manage my process
mapping by
Post by Nicolas Deladerriere
only using slurm parameters and not specifying host mapping in my
mpirun
Post by Nicolas Deladerriere
command, but in order to minimize modification in my production
environment
Post by Nicolas Deladerriere
I had to use such solution.
Note 2: I know I am using old openmpi version !
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...