Hi All,
Iâm glad to see this come up. Weâve used OpenMPI for a long time and switched to SLURM (from torque+moab) about 2.5 years ago. At the time, I had a lot of questions about running MPI jobs under SLURM and good information seemed to be scarce - especially regarding âsrunâ. Iâll just briefly share my/our observations. For those who are interested, there are examples of our suggested submission scripts at https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job <https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts#MPI_job> (as I type this Iâm hoping that page is up-to-date). Feel free to comment or make suggestions if you have had different experiences or know better (very possible).
1. We initially ignored srun since mpiexec _seemed_ to work fine (more below).
2. We soon started to get user complaints of MPI apps running at 1/2 to 1/3 of their expected or previously observed speeds - but only sporadically - meaning that sometimes the same job, submitted the same way would run at full speed and sometimes at 1/2 or 1/3 (almost exactly) speed.
Investigation showed that some MPI ranks in the job were time-slicing across one or more of the cores allocated by SLURM. It turns out that if the slurm allocation is not consistent with the default OMPI core/socket mapping, this can easily happen. It can be avoided by a) using âsrun âmpi=pmi2â or as of 2.x, âsrun âmpi=pmixâ or b) more carefully crafting your slurm resource request to be consistent with the OMPI default core/socket mapping.
So beware of resource requests that specify only the number of tasks (ântasks=64) and then launch with âmpiexecâ. Slurm will happily allocate those tasks anywhere it can (on a busy cluster) and you will get some very non-optimal core mappings/bindings and, possibly, core sharing.
3. While doing some spank development for a local, per-job (not per step) temporary directory, I noticed that when launching multi-host MPI jobs with mpiexec vs srun, you end up with more than one host with âslurm_nodeid=1â. Iâm not sure if this is a bug (it was 15.08.x) or not and it didnât seem to cause issues but I also donât think that it is ideal for two nodes in the same job to have the some numeric nodeid. When launching with âsrunâ, that didnât happen.
Anyway, that is what we have observed. Generally speaking, I try to get users to use âsrunâ but many of them still use âmpiexecâ out of habit. You know what they say about old habits.
Comments, suggestions, or just other experiences are welcome. Also, if anyone is interested in the tmpdir spank plugin, you can contact me. We are happy to share.
Best and Merry Christmas to all,
Charlie Taylor
UF Research Computing
We have had reports of applications running faster when executing under OMPIâs mpiexec versus when started by srun. Reasons arenât entirely clear, but are likely related to differences in mapping/binding options (OMPI provides a very large range compared to srun) and optimization flags provided by mpiexec that are specific to OMPI.
OMPI uses PMIx for wireup support (starting with the v2.x series), which provides a faster startup than other PMI implementations. However, that is also available with Slurm starting with the 16.05 release, and some further PMIx-based launch optimizations were recently added to the Slurm 17.11 release. So I would expect that launch via srun with the latest Slurm release and PMIx would be faster than mpiexec - though that still leaves the faster execution reports to consider.
HTH
Ralph
Post by Prentice BisbalGreeting OpenMPI users and devs!
We use OpenMPI with Slurm as our scheduler, and a user has asked me this: should they use mpiexec/mpirun or srun to start their MPI jobs through Slurm?
My inclination is to use mpiexec, since that is the only method that's (somewhat) defined in the MPI standard and therefore the most portable, and the examples in the OpenMPI FAQ use mpirun. However, the Slurm documentation on the schedmd website say to use srun with the --mpi=pmi option. (See links below)
What are the pros/cons of using these two methods, other than the portability issue I already mentioned? Does srun+pmi use a different method to wire up the connections? Some things I read online seem to indicate that. If slurm was built with PMI support, and OpenMPI was built with Slurm support, does it really make any difference?
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dslurm&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=S8O6oozkRUTwijpQQDmGrJZb8Bmnsro9a88Z8CMu6jY&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_mpi-5Fguide.html-23open-5Fmpi&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=yqxzEPgafSoGS_SpzI5MPObbJIcemIX7Z4AHgk4SseA&e=
--
Prentice
_______________________________________________
users mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=
_______________________________________________
users mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=Sy962rSDsvvbXNklxKSTlm8Vk-RymisPdTjspDVlROI&s=AKWlbF5DdrTaJapOsTSYDiSa3bJTFnjlG6Whfi2_MA4&e=