Sorry Ralph, I have to respectfully disagree with you on this one. I believe that the output below shows that the issue is that the two different MPIs launch things differently. On one node, I ran:
[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 4 --bynode sleep 300
And then checked the process tree on the remote node:
[***@node1 mpi]$ ps -fu brent
UID PID PPID C STIME TTY TIME CMD
brent 1709 1706 0 10:00 ? 00:00:00 /mnt/node1/home/brent/bin/openmpi143/bin/orted -mca
brent 1712 1709 0 10:00 ? 00:00:00 sleep 300
brent 1713 1709 0 10:00 ? 00:00:00 sleep 300
brent 1714 18458 0 10:00 pts/0 00:00:00 ps -fu brent
brent 13282 13281 0 Feb17 pts/0 00:00:00 -bash
brent 18458 13282 0 Feb23 pts/0 00:00:00 -csh
[***@node1 mpi]$ ps -fp 1706
UID PID PPID C STIME TTY TIME CMD
root 1706 1 0 10:00 ? 00:00:00 slurmstepd: [29.9]
[***@node1 mpi]$
Note that the parent of the sleep processes is orted and that orted was started by slurmstepd. Unless orted is updating the slurm variables for the children (which is doubtful) then they will not contain the specific settings that I see when I run srun directly. I launch with HP-MPI like this:
[***@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4 sleep 300
I then see the following in the process tree on the remote node:
[***@node1 mpi]$ ps -fu brent
UID PID PPID C STIME TTY TIME CMD
brent 1741 1738 0 10:02 ? 00:00:00 /bin/sleep 300
brent 1742 1738 0 10:02 ? 00:00:00 /bin/sleep 300
brent 1745 18458 0 10:02 pts/0 00:00:00 ps -fu brent
brent 13282 13281 0 Feb17 pts/0 00:00:00 -bash
brent 18458 13282 0 Feb23 pts/0 00:00:00 -csh
[***@node1 mpi]$ ps -fp 1738
UID PID PPID C STIME TTY TIME CMD
root 1738 1 0 10:02 ? 00:00:00 slurmstepd: [29.10]
[***@node1 mpi]$
Since the parent of both of the sleep processes is slurmstepd, it is setting things up as I would expect. This lineage is the same as I find by running srun directly.
Now, the question still is, why does this work for Jeff? :) Is there a way to get orted out of the way so the sleep processes are launched directly by srun?
brent
From: users-***@open-mpi.org [mailto:users-***@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, February 24, 2011 10:05 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
I would talk to the slurm folks about it - I don't know anything about the internals of HP-MPI, but I do know the relevant OMPI internals. OMPI doesn't do anything with respect to the envars. We just use "srun -hostlist <fff>" to launch the daemons. Each daemon subsequently gets a message telling it what local procs to run, and then fork/exec's those procs. The environment set for those procs is a copy of that given to the daemon, including any and all slurm values.
So whatever slurm sets, your procs get.
My guess is that HP-MPI is doing something with the envars to create the difference.
As for running OMPI procs directly from srun: the slurm folks put out a faq (or its equivalent) on it, I believe. I don't recall the details (even though I wrote the integration...). If you google our user and/or devel mailing lists, though, you'll see threads discussing it. Look for "slurmd" in the text - that's the ORTE integration module for that feature.
On Thu, Feb 24, 2011 at 8:55 AM, Henderson, Brent <***@hp.com> wrote:
I'm running OpenMPI v1.4.3 and slurm v2.2.1. I built both with the default configuration except setting the prefix. The tests were run on the exact same nodes (I only have two).
When I run the test you outline below, I am still missing a bunch of env variables with OpenMPI. I ran the extra test of using HP-MPI and they are all present as with the srun invocation. I don't know if this is my slurm setup or not, but I find this really weird. If anyone knows the magic to make the fix that Ralph is referring to, I'd appreciate a pointer.
My guess was that there is a subtle way that the launch differs between the two products. But, since it works for Jeff, maybe there really is a slurm option that I need to compile in or set to make this work the way I want. It is not as simple as HP-MPI moving the environment variables itself as some of the numbers will change per process created on the remote nodes.
Thanks,
Brent
[***@node2 mpi]$ salloc -N 2
salloc: Granted job allocation 29
[***@node2 mpi]$ srun env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
[***@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
66
[***@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=8(x2)
SLURM_JOB_ID=29
SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
SLURM_JOB_NODELIST=node[1-2]
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_JOB_NUM_NODES=2
SLURM_NODELIST=node[1-2]
[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
42 <-- note, not 66 as above!
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort > mpirun.out
[***@node2 mpi]$ diff srun.out mpirun.out
2d1
< SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
4,5d2
< SLURM_CPUS_ON_NODE=8
< SLURM_CPUS_PER_TASK=1
8d4
< SLURM_DISTRIBUTION=cyclic
10d5
< SLURM_GTIDS=1
22,23d16
< SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
< SLURM_LOCALID=0
25c18
< SLURM_NNODES=2
---
28d20
< SLURM_NODEID=1
31,35c23,24
< SLURM_NPROCS=2
< SLURM_NPROCS=2
< SLURM_NTASKS=2
< SLURM_NTASKS=2
< SLURM_PRIO_PROCESS=0
---
Post by Henderson, BrentSLURM_NPROCS=1
SLURM_NTASKS=1
38d26
< SLURM_PROCID=1
40,56c28,35
< SLURM_SRUN_COMM_HOST=10.0.205.134
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_TASKS_PER_NODE=1(x2)
< SLURM_STEP_TASKS_PER_NODE=1(x2)
---
Post by Henderson, BrentSLURM_SRUN_COMM_PORT=45154
SLURM_STEP_ID=5
SLURM_STEPID=5
SLURM_STEP_LAUNCHER_PORT=45154
SLURM_STEP_NODELIST=node1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
59,62c38,40
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
< SLURM_TASKS_PER_NODE=1(x2)
< SLURM_TASKS_PER_NODE=1(x2)
---
Post by Henderson, BrentSLURM_TASK_PID=1429
SLURM_TASKS_PER_NODE=1
SLURM_TASKS_PER_NODE=8(x2)
64,65d41
< SLURM_TOPOLOGY_ADDR=node2
< SLURM_TOPOLOGY_ADDR_PATTERN=node
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep ^SLURM_ | sort > hpmpi.out
[***@node2 mpi]$ diff srun.out hpmpi.out
20a21,22
Post by Henderson, BrentSLURM_KILL_BAD_EXIT=1
SLURM_KILL_BAD_EXIT=1
41,48c43,50
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
---
Post by Henderson, BrentSLURM_SRUN_COMM_PORT=33347
SLURM_SRUN_COMM_PORT=33347
SLURM_STEP_ID=8
SLURM_STEP_ID=8
SLURM_STEPID=8
SLURM_STEPID=8
SLURM_STEP_LAUNCHER_PORT=33347
SLURM_STEP_LAUNCHER_PORT=33347
59,60c61,62
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
---
Post by Henderson, BrentSLURM_TASK_PID=1592
SLURM_TASK_PID=2590
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$ grep SLURM_PROCID srun.out
SLURM_PROCID=0
SLURM_PROCID=1
[***@node2 mpi]$ grep SLURM_PROCID mpirun.out
SLURM_PROCID=0
[***@node2 mpi]$ grep SLURM_PROCID hpmpi.out
SLURM_PROCID=0
SLURM_PROCID=1
Post by Henderson, Brent-----Original Message-----
Behalf Of Jeff Squyres
Sent: Thursday, February 24, 2011 9:31 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
The weird thing is that when running his test, he saw different results
with HP MPI vs. Open MPI.
What his test didn't say was whether those were the same exact nodes or
not. It would be good to repeat my experiment with the same exact
nodes (e.g., inside one SLURM salloc job, or use the -w param to
specify the same nodes for salloc for OMPI and srun for HP MPI).
Post by Ralph CastainLike I said, this isn't an OMPI problem. You have your slurm
configured to pass certain envars to the remote nodes, and Brent
doesn't. It truly is just that simple.
Post by Ralph CastainI've seen this before with other slurm installations. Which envars
get set on the backend is configurable, that's all.
Post by Ralph CastainHas nothing to do with OMPI.
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for
comparison, later.
Post by Ralph Castain$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults
to running one process per core (vs. srun's default of running one per
node). So let's tone mpirun down to use one process per node and look
for the SLURM_ env variables.
Post by Ralph Castain$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) --
these differences are expected.
Post by Ralph CastainWhat version of OMPI are you running? What happens if you repeat
this experiment?
Post by Ralph CastainI would find it very strange if Open MPI's mpirun is filtering some
SLURM env variables to some processes and not to all -- your output
shows disparate output between the different processes. That's just
plain weird.
SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
Post by Ralph CastainSLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use
salloc - right? In this case, it is up to OpenMPI to interpret the
SLURM environment variables it sees in the one process that is launched
and 'do the right thing' - whatever that means in this case. How does
OpenMPI start the processes on the remote nodes under the covers? (use
srun, generate a hostfile and launch as you would outside SLURM, ...)
This may be the difference between HP-MPI and OpenMPI.
mpi.org] On Behalf Of Ralph Castain
Post by Ralph CastainSent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program
passing RM-managed envars from one node to another, and this is
certainly true of slurm. The reason is that the RM reserves those
values for its own use when managing remote nodes. For example, if you
got an allocation and then used mpirun to launch a job across only a
portion of that allocation, and then ran another mpirun instance in
parallel on the remainder of the nodes, the slurm envars for those two
mpirun instances -need- to be quite different. Having mpirun forward
the values it sees would cause the system to become very confused.
Post by Ralph CastainWe learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to
provide your desired envars on the remote nodes. This is the
recommended (by slurm and other RMs) way of getting what you requested.
It is a simple configuration option - if he needs help, he should
contact the slurm mailing list
Post by Ralph Castain(b) you can ask mpirun to do so, at your own risk. Specify each
parameter with a "-x FOO" argument. See "man mpirun" for details. Keep
an eye out for aberrant behavior.
Post by Ralph CastainRalph
On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent
Hi Everyone, I have an OpenMPI/SLURM specific question,
I'm using MPI as a launcher for another application I'm working on
and it is dependent on the SLURM environment variables making their way
into the a.out's environment. This works as I need if I use HP-
MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I
would like across all of the ranks.
Post by Ralph CastainI have example output below from a simple a.out that just writes
out the environment that it sees to a file whose name is based on the
node name and rank number. Note that with OpenMPI, that things like
SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on
the different nodes and things like SLURM_LOCALID are just missing
entirely.
Post by Ralph CastainSo the question is, should the environment variables on the remote
nodes (from the perspective of where the job is launched) have the full
set of SLURM environment variables as seen on the launching node?
Post by Ralph CastainThanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
Post by Ralph CastainSLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
Post by Ralph CastainSLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
./printenv.hpmpi
Post by Ralph CastainHello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node1.1.of.4
Post by Ralph CastainSLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node2.3.of.4
Post by Ralph CastainSLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users