[OMPI users] OpenMPI & Slurm: mpiexec/mpirun vs. srun

Discussion:

Prentice Bisbal

2017-12-18 22:18:53 UTC

Greeting OpenMPI users and devs!

We use OpenMPI with Slurm as our scheduler, and a user has asked me
this: should they use mpiexec/mpirun or srun to start their MPI jobs
through Slurm?

My inclination is to use mpiexec, since that is the only method that's
(somewhat) defined in the MPI standard and therefore the most portable,
and the examples in the OpenMPI FAQ use mpirun. However, the Slurm
documentation on the schedmd website say to use srun with the --mpi=pmi
option. (See links below)

What are the pros/cons of using these two methods, other than the
portability issue I already mentioned? Does srun+pmi use a different
method to wire up the connections? Some things I read online seem to
indicate that. If slurm was built with PMI support, and OpenMPI was
built with Slurm support, does it really make any difference?

https://www.open-mpi.org/faq/?category=slurm
https://slurm.schedmd.com/mpi_guide.html#open_mpi

--
Prentice

r***@open-mpi.org

2017-12-19 01:12:10 UTC

Permalink

We have had reports of applications running faster when executing under OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, but are likely related to differences in mapping/binding options (OMPI provides a very large range compared to srun) and optimization flags provided by mpiexec that are specific to OMPI.

OMPI uses PMIx for wireup support (starting with the v2.x series), which provides a faster startup than other PMI implementations. However, that is also available with Slurm starting with the 16.05 release, and some further PMIx-based launch optimizations were recently added to the Slurm 17.11 release. So I would expect that launch via srun with the latest Slurm release and PMIx would be faster than mpiexec - though that still leaves the faster execution reports to consider.

HTH
Ralph

Prentice Bisbal

2017-12-19 14:53:26 UTC

Permalink

Ralph,

Thank your very much for your response. I'll pass this along to my
users. Sounds lie we might need to do some testing of our own. We're
still using Slurm 15.08, but planning to upgrade to 17.11 soon, so it
sounds like we'll get some performance benefits from doing so.

Prentice

Post by r***@open-mpi.org
We have had reports of applications running faster when executing under OMPI’s mpiexec versus when started by srun. Reasons aren’t entirely clear, but are likely related to differences in mapping/binding options (OMPI provides a very large range compared to srun) and optimization flags provided by mpiexec that are specific to OMPI.
OMPI uses PMIx for wireup support (starting with the v2.x series), which provides a faster startup than other PMI implementations. However, that is also available with Slurm starting with the 16.05 release, and some further PMIx-based launch optimizations were recently added to the Slurm 17.11 release. So I would expect that launch via srun with the latest Slurm release and PMIx would be faster than mpiexec - though that still leaves the faster execution reports to consider.
HTH
Ralph

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Charles A Taylor

2017-12-19 16:46:20 UTC

Permalink

Hi All,

Iâm glad to see this come up. Weâve used OpenMPI for a long time and switched to SLURM (from torque+moab) about 2.5 years ago. At the time, I had a lot of questions about running MPI jobs under SLURM and good information seemed to be scarce - especially regarding âsrunâ. Iâll just briefly share my/our observations. For those who are interested, there are examples of our suggested submission scripts at https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job <https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job> (as I type this Iâm hoping that page is up-to-date). Feel free to comment or make suggestions if you have had different experiences or know better (very possible).

1. We initially ignored srun since mpiexec _seemed_ to work fine (more below).

2. We soon started to get user complaints of MPI apps running at 1/2 to 1/3 of their expected or previously observed speeds - but only sporadically - meaning that sometimes the same job, submitted the same way would run at full speed and sometimes at 1/2 or 1/3 (almost exactly) speed.

Investigation showed that some MPI ranks in the job were time-slicing across one or more of the cores allocated by SLURM. It turns out that if the slurm allocation is not consistent with the default OMPI core/socket mapping, this can easily happen. It can be avoided by a) using âsrun âmpi=pmi2â or as of 2.x, âsrun âmpi=pmixâ or b) more carefully crafting your slurm resource request to be consistent with the OMPI default core/socket mapping.

So beware of resource requests that specify only the number of tasks (ântasks=64) and then launch with âmpiexecâ. Slurm will happily allocate those tasks anywhere it can (on a busy cluster) and you will get some very non-optimal core mappings/bindings and, possibly, core sharing.

3. While doing some spank development for a local, per-job (not per step) temporary directory, I noticed that when launching multi-host MPI jobs with mpiexec vs srun, you end up with more than one host with âslurm_nodeid=1â. Iâm not sure if this is a bug (it was 15.08.x) or not and it didnât seem to cause issues but I also donât think that it is ideal for two nodes in the same job to have the some numeric nodeid. When launching with âsrunâ, that didnât happen.

Anyway, that is what we have observed. Generally speaking, I try to get users to use âsrunâ but many of them still use âmpiexecâ out of habit. You know what they say about old habits.

Comments, suggestions, or just other experiences are welcome. Also, if anyone is interested in the tmpdir spank plugin, you can contact me. We are happy to share.

Best and Merry Christmas to all,

Charlie Taylor
UF Research Computing

We have had reports of applications running faster when executing under OMPIâs mpiexec versus when started by srun. Reasons arenât entirely clear, but are likely related to differences in mapping/binding options (OMPI provides a very large range compared to srun) and optimization flags provided by mpiexec that are specific to OMPI.
OMPI uses PMIx for wireup support (starting with the v2.x series), which provides a faster startup than other PMI implementations. However, that is also available with Slurm starting with the 16.05 release, and some further PMIx-based launch optimizations were recently added to the Slurm 17.11 release. So I would expect that launch via srun with the latest Slurm release and PMIx would be faster than mpiexec - though that still leaves the faster execution reports to consider.
HTH
Ralph

r***@open-mpi.org

2017-12-19 17:05:49 UTC

Permalink

Post by Charles A Taylor
Hi All,
Iâm glad to see this come up. Weâve used OpenMPI for a long time and switched to SLURM (from torque+moab) about 2.5 years ago. At the time, I had a lot of questions about running MPI jobs under SLURM and good information seemed to be scarce - especially regarding âsrunâ. Iâll just briefly share my/our observations. For those who are interested, there are examples of our suggested submission scripts at https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job <https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job> (as I type this Iâm hoping that page is up-to-date). Feel free to comment or make suggestions if you have had different experiences or know better (very possible).
1. We initially ignored srun since mpiexec _seemed_ to work fine (more below).
2. We soon started to get user complaints of MPI apps running at 1/2 to 1/3 of their expected or previously observed speeds - but only sporadically - meaning that sometimes the same job, submitted the same way would run at full speed and sometimes at 1/2 or 1/3 (almost exactly) speed.
Investigation showed that some MPI ranks in the job were time-slicing across one or more of the cores allocated by SLURM. It turns out that if the slurm allocation is not consistent with the default OMPI core/socket mapping, this can easily happen. It can be avoided by a) using âsrun âmpi=pmi2â or as of 2.x, âsrun âmpi=pmixâ or b) more carefully crafting your slurm resource request to be consistent with the OMPI default core/socket mapping.

Or one could tell OMPI to do what you really want it to do using map-by and bind-to options, perhaps putting them in the default MCA param file.

Or you could enable cgroups in slurm so that OMPI sees the binding envelope - it will respect it. The problem is that OMPI isnât seeing the requested binding envelope and thinks resources are available that really arenât, and so it gets confused about how to map things. Slurm expresses that envelope in an envar, but the name and syntax keep changing over the releases, and we just canât track it all the time.

However, I agree that it can be a problem if Slurm is allocating resources in a non-HPC manner (i.e., not colocating allocations to maximize performance) and you just want to use the default mpiexec options. We only see that when someone configures slurm to not allocate nodes to single users, which is not the normal HPC mode of operation.

So if you are going to configure slurm to operate in the âcloudâ mode of allocating individual processor assets, then yes - probably better to use srun instead of the default mpiexec options, or add some directives to the default MCA param file.

Post by Charles A Taylor
So beware of resource requests that specify only the number of tasks (ântasks=64) and then launch with âmpiexecâ. Slurm will happily allocate those tasks anywhere it can (on a busy cluster) and you will get some very non-optimal core mappings/bindings and, possibly, core sharing.
3. While doing some spank development for a local, per-job (not per step) temporary directory, I noticed that when launching multi-host MPI jobs with mpiexec vs srun, you end up with more than one host with âslurm_nodeid=1â. Iâm not sure if this is a bug (it was 15.08.x) or not and it didnât seem to cause issues but I also donât think that it is ideal for two nodes in the same job to have the some numeric nodeid. When launching with âsrunâ, that didnât happen.

Iâm not sure what âslurm_nodeidâ is - where does this come from?

Post by Charles A Taylor
Anyway, that is what we have observed. Generally speaking, I try to get users to use âsrunâ but many of them still use âmpiexecâ out of habit. You know what they say about old habits.

Again, it truly depends on how things are configured, if the users are using scripts that need to port to other environments, etc.

Post by Charles A Taylor
Comments, suggestions, or just other experiences are welcome. Also, if anyone is interested in the tmpdir spank plugin, you can contact me. We are happy to share.
Best and Merry Christmas to all,
Charlie Taylor
UF Research Computing

Post by Prentice Bisbal
Greeting OpenMPI users and devs!
We use OpenMPI with Slurm as our scheduler, and a user has asked me this: should they use mpiexec/mpirun or srun to start their MPI jobs through Slurm?
My inclination is to use mpiexec, since that is the only method that's (somewhat) defined in the MPI standard and therefore the most portable, and the examples in the OpenMPI FAQ use mpirun. However, the Slurm documentation on the schedmd website say to use srun with the --mpi=pmi option. (See links below)
What are the pros/cons of using these two methods, other than the portability issue I already mentioned? Does srun+pmi use a different method to wire up the connections? Some things I read online seem to indicate that. If slurm was built with PMI support, and OpenMPI was built with Slurm support, does it really make any difference?
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY&e=>
https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_mpi-5Fguide.html-23open-5Fmpi&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=yqxzEPgafSoGS_SpzI5MPObbJIcemIX7Z4AHgk4SseA&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_mpi-5Fguide.html-23open-5Fmpi&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=yqxzEPgafSoGS_SpzI5MPObbJIcemIX7Z4AHgk4SseA&e=>
--
Prentice
_______________________________________________
users mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=>

_______________________________________________
users mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=>

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users <https://lists.open-mpi.org/mailman/listinfo/users>

Charles A Taylor

2017-12-19 17:43:57 UTC

Permalink

Post by r***@open-mpi.org
Or one could tell OMPI to do what you really want it to do using map-by and bind-to options, perhaps putting them in the default MCA param file.

Nod. Agreed, but far too complicated for 98% of our users.

Post by r***@open-mpi.org
Or you could enable cgroups in slurm so that OMPI sees the binding envelope - it will respect it.

We’ve configured cgroups from the beginning.

Post by r***@open-mpi.org
The problem is that OMPI isn’t seeing the requested binding envelope and thinks resources are available that really aren’t, and so it gets confused about how to map things. Slurm expresses that envelope in an envar, but the name and syntax keep changing over the releases, and we just can’t track it all the time.

Understood.

Post by r***@open-mpi.org
I’m not sure what “slurm_nodeid” is - where does this come from?

Sorry, it was S_JOB_NODEID from spank.h. I ended up changing my approach to the tmpdir creation because of this and the fact the the job’s UID/GID were not available in the SPANK routine where I needed them. I would hope that this maps to the exported env variable SLURM_NODEID but I don’t know that for sure.

Thanks for the feedback,

Charlie