Discussion:
[OMPI users] SLURM environment variables at runtime
Henderson, Brent
2011-02-23 15:38:35 UTC
Permalink
Hi Everyone, I have an OpenMPI/SLURM specific question,

I'm using MPI as a launcher for another application I'm working on and it is dependent on the SLURM environment variables making their way into the a.out's environment. This works as I need if I use HP-MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I would like across all of the ranks.

I have example output below from a simple a.out that just writes out the environment that it sees to a file whose name is based on the node name and rank number. Note that with OpenMPI, that things like SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes and things like SLURM_LOCALID are just missing entirely.

So the question is, should the environment variables on the remote nodes (from the perspective of where the job is launched) have the full set of SLURM environment variables as seen on the launching node?

Thanks,

Brent Henderson

[***@node2 mpi]$ rm node*
[***@node2 mpi]$ mkdir openmpi hpmpi
[***@node2 mpi]$ salloc -N 2 -n 4 mpirun ./printenv.openmpi
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
[***@node2 mpi]$ mv node* openmpi/
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' openmpi/node1.3.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' openmpi/node2.1.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
[***@node2 mpi]$


[***@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4 ./printenv.hpmpi
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
[***@node2 mpi]$ mv node* hpmpi/
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node1.1.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node2.3.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
[***@node2 mpi]$
Ralph Castain
2011-02-23 16:07:05 UTC
Permalink
Resource managers generally frown on the idea of any program passing
RM-managed envars from one node to another, and this is certainly true of
slurm. The reason is that the RM reserves those values for its own use when
managing remote nodes. For example, if you got an allocation and then used
mpirun to launch a job across only a portion of that allocation, and then
ran another mpirun instance in parallel on the remainder of the nodes, the
slurm envars for those two mpirun instances -need- to be quite different.
Having mpirun forward the values it sees would cause the system to become
very confused.

We learned the hard way never to cross that line :-(

You have two options:

(a) you could get your sys admin to configure slurm correctly to provide
your desired envars on the remote nodes. This is the recommended (by slurm
and other RMs) way of getting what you requested. It is a simple
configuration option - if he needs help, he should contact the slurm mailing
list

(b) you can ask mpirun to do so, at your own risk. Specify each parameter
with a "-x FOO" argument. See "man mpirun" for details. Keep an eye out for
aberrant behavior.

Ralph
Post by Henderson, Brent
Hi Everyone, I have an OpenMPI/SLURM specific question,
I’m using MPI as a launcher for another application I’m working on and it
is dependent on the SLURM environment variables making their way into the
a.out’s environment. This works as I need if I use HP-MPI/PMPI, but when I
use OpenMPI, it appears that not all are set as I would like across all of
the ranks.
I have example output below from a simple a.out that just writes out the
environment that it sees to a file whose name is based on the node name and
rank number. Note that with OpenMPI, that things like SLURM_NNODES and
SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes
and things like SLURM_LOCALID are just missing entirely.
So the question is, should the environment variables on the remote nodes
(from the perspective of where the job is launched) have the full set of
SLURM environment variables as seen on the launching node?
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node1.1.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node2.3.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Henderson, Brent
2011-02-23 17:05:46 UTC
Permalink
SLURM seems to be doing this in the case of a regular srun:

[***@node1 mpi]$ srun -N 2 -n 4 env | egrep SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
[***@node1 mpi]$

Since srun is not supported currently by OpenMPI, I have to use salloc - right? In this case, it is up to OpenMPI to interpret the SLURM environment variables it sees in the one process that is launched and 'do the right thing' - whatever that means in this case. How does OpenMPI start the processes on the remote nodes under the covers? (use srun, generate a hostfile and launch as you would outside SLURM, ...) This may be the difference between HP-MPI and OpenMPI.

Thanks,

Brent


From: users-***@open-mpi.org [mailto:users-***@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime

Resource managers generally frown on the idea of any program passing RM-managed envars from one node to another, and this is certainly true of slurm. The reason is that the RM reserves those values for its own use when managing remote nodes. For example, if you got an allocation and then used mpirun to launch a job across only a portion of that allocation, and then ran another mpirun instance in parallel on the remainder of the nodes, the slurm envars for those two mpirun instances -need- to be quite different. Having mpirun forward the values it sees would cause the system to become very confused.

We learned the hard way never to cross that line :-(

You have two options:

(a) you could get your sys admin to configure slurm correctly to provide your desired envars on the remote nodes. This is the recommended (by slurm and other RMs) way of getting what you requested. It is a simple configuration option - if he needs help, he should contact the slurm mailing list

(b) you can ask mpirun to do so, at your own risk. Specify each parameter with a "-x FOO" argument. See "man mpirun" for details. Keep an eye out for aberrant behavior.

Ralph

On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent <***@hp.com<mailto:***@hp.com>> wrote:
Hi Everyone, I have an OpenMPI/SLURM specific question,

I'm using MPI as a launcher for another application I'm working on and it is dependent on the SLURM environment variables making their way into the a.out's environment. This works as I need if I use HP-MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I would like across all of the ranks.

I have example output below from a simple a.out that just writes out the environment that it sees to a file whose name is based on the node name and rank number. Note that with OpenMPI, that things like SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes and things like SLURM_LOCALID are just missing entirely.

So the question is, should the environment variables on the remote nodes (from the perspective of where the job is launched) have the full set of SLURM environment variables as seen on the launching node?

Thanks,

Brent Henderson

[***@node2 mpi]$ rm node*
[***@node2 mpi]$ mkdir openmpi hpmpi
[***@node2 mpi]$ salloc -N 2 -n 4 mpirun ./printenv.openmpi
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
[***@node2 mpi]$ mv node* openmpi/
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' openmpi/node1.3.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' openmpi/node2.1.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
[***@node2 mpi]$


[***@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4 ./printenv.hpmpi
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
[***@node2 mpi]$ mv node* hpmpi/
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node1.1.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
[***@node2 mpi]$ egrep 'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node2.3.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
[***@node2 mpi]$
Ralph Castain
2011-02-23 17:39:14 UTC
Permalink
We use srun internally to start the remote daemons. We construct a
nodelist from the user-specified inputs, and pass that to srun so it
knows where to start the daemons.


On Wednesday, February 23, 2011, Henderson, Brent
Post by Henderson, Brent
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Jeff Squyres
2011-02-24 14:18:38 UTC
Permalink
I'm afraid I don't see the problem. Let's get 4 nodes from slurm:

$ salloc -N 4

Now let's run env and see what SLURM_ env variables we see:

$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144

Good -- there's 144 of them. Let's save them to a file for comparison, later.

$ srun env | egrep ^SLURM_ | sort > srun.out

Now let's repeat the process with mpirun. Note that mpirun defaults to running one process per core (vs. srun's default of running one per node). So let's tone mpirun down to use one process per node and look for the SLURM_ env variables.

$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144

Good -- we also got 144. Save them to a file.

$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out

Now let's compare what we got from srun and from mpirun:

$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$

They're identical except for per-step values (ports, PIDs, etc.) -- these differences are expected.

What version of OMPI are you running? What happens if you repeat this experiment?

I would find it very strange if Open MPI's mpirun is filtering some SLURM env variables to some processes and not to all -- your output shows disparate output between the different processes. That's just plain weird.
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use salloc – right? In this case, it is up to OpenMPI to interpret the SLURM environment variables it sees in the one process that is launched and ‘do the right thing’ – whatever that means in this case. How does OpenMPI start the processes on the remote nodes under the covers? (use srun, generate a hostfile and launch as you would outside SLURM, …) This may be the difference between HP-MPI and OpenMPI.
Thanks,
Brent
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program passing RM-managed envars from one node to another, and this is certainly true of slurm. The reason is that the RM reserves those values for its own use when managing remote nodes. For example, if you got an allocation and then used mpirun to launch a job across only a portion of that allocation, and then ran another mpirun instance in parallel on the remainder of the nodes, the slurm envars for those two mpirun instances -need- to be quite different. Having mpirun forward the values it sees would cause the system to become very confused.
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to provide your desired envars on the remote nodes. This is the recommended (by slurm and other RMs) way of getting what you requested. It is a simple configuration option - if he needs help, he should contact the slurm mailing list
(b) you can ask mpirun to do so, at your own risk. Specify each parameter with a "-x FOO" argument. See "man mpirun" for details. Keep an eye out for aberrant behavior.
Ralph
Hi Everyone, I have an OpenMPI/SLURM specific question,
I’m using MPI as a launcher for another application I’m working on and it is dependent on the SLURM environment variables making their way into the a.out’s environment. This works as I need if I use HP-MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I would like across all of the ranks.
I have example output below from a simple a.out that just writes out the environment that it sees to a file whose name is based on the node name and rank number. Note that with OpenMPI, that things like SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes and things like SLURM_LOCALID are just missing entirely.
So the question is, should the environment variables on the remote nodes (from the perspective of where the job is launched) have the full set of SLURM environment variables as seen on the launching node?
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Ralph Castain
2011-02-24 15:02:07 UTC
Permalink
Like I said, this isn't an OMPI problem. You have your slurm configured to
pass certain envars to the remote nodes, and Brent doesn't. It truly is just
that simple.

I've seen this before with other slurm installations. Which envars get set
on the backend is configurable, that's all.

Has nothing to do with OMPI.
Post by Jeff Squyres
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for comparison, later.
$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults to
running one process per core (vs. srun's default of running one per node).
So let's tone mpirun down to use one process per node and look for the
SLURM_ env variables.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) -- these
differences are expected.
What version of OMPI are you running? What happens if you repeat this experiment?
I would find it very strange if Open MPI's mpirun is filtering some SLURM
env variables to some processes and not to all -- your output shows
disparate output between the different processes. That's just plain weird.
SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use salloc –
right? In this case, it is up to OpenMPI to interpret the SLURM environment
variables it sees in the one process that is launched and ‘do the right
thing’ – whatever that means in this case. How does OpenMPI start the
processes on the remote nodes under the covers? (use srun, generate a
hostfile and launch as you would outside SLURM, …) This may be the
difference between HP-MPI and OpenMPI.
Thanks,
Brent
Behalf Of Ralph Castain
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program passing
RM-managed envars from one node to another, and this is certainly true of
slurm. The reason is that the RM reserves those values for its own use when
managing remote nodes. For example, if you got an allocation and then used
mpirun to launch a job across only a portion of that allocation, and then
ran another mpirun instance in parallel on the remainder of the nodes, the
slurm envars for those two mpirun instances -need- to be quite different.
Having mpirun forward the values it sees would cause the system to become
very confused.
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to provide
your desired envars on the remote nodes. This is the recommended (by slurm
and other RMs) way of getting what you requested. It is a simple
configuration option - if he needs help, he should contact the slurm mailing
list
(b) you can ask mpirun to do so, at your own risk. Specify each parameter
with a "-x FOO" argument. See "man mpirun" for details. Keep an eye out for
aberrant behavior.
Ralph
On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent <
Hi Everyone, I have an OpenMPI/SLURM specific question,
I’m using MPI as a launcher for another application I’m working on and it
is dependent on the SLURM environment variables making their way into the
a.out’s environment. This works as I need if I use HP-MPI/PMPI, but when I
use OpenMPI, it appears that not all are set as I would like across all of
the ranks.
I have example output below from a simple a.out that just writes out the
environment that it sees to a file whose name is based on the node name and
rank number. Note that with OpenMPI, that things like SLURM_NNODES and
SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes
and things like SLURM_LOCALID are just missing entirely.
So the question is, should the environment variables on the remote nodes
(from the perspective of where the job is launched) have the full set of
SLURM environment variables as seen on the launching node?
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
./printenv.hpmpi
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node1.1.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node2.3.of.4
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-02-24 15:30:46 UTC
Permalink
The weird thing is that when running his test, he saw different results with HP MPI vs. Open MPI.

What his test didn't say was whether those were the same exact nodes or not. It would be good to repeat my experiment with the same exact nodes (e.g., inside one SLURM salloc job, or use the -w param to specify the same nodes for salloc for OMPI and srun for HP MPI).
Like I said, this isn't an OMPI problem. You have your slurm configured to pass certain envars to the remote nodes, and Brent doesn't. It truly is just that simple.
I've seen this before with other slurm installations. Which envars get set on the backend is configurable, that's all.
Has nothing to do with OMPI.
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for comparison, later.
$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults to running one process per core (vs. srun's default of running one per node). So let's tone mpirun down to use one process per node and look for the SLURM_ env variables.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) -- these differences are expected.
What version of OMPI are you running? What happens if you repeat this experiment?
I would find it very strange if Open MPI's mpirun is filtering some SLURM env variables to some processes and not to all -- your output shows disparate output between the different processes. That's just plain weird.
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use salloc – right? In this case, it is up to OpenMPI to interpret the SLURM environment variables it sees in the one process that is launched and ‘do the right thing’ – whatever that means in this case. How does OpenMPI start the processes on the remote nodes under the covers? (use srun, generate a hostfile and launch as you would outside SLURM, …) This may be the difference between HP-MPI and OpenMPI.
Thanks,
Brent
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program passing RM-managed envars from one node to another, and this is certainly true of slurm. The reason is that the RM reserves those values for its own use when managing remote nodes. For example, if you got an allocation and then used mpirun to launch a job across only a portion of that allocation, and then ran another mpirun instance in parallel on the remainder of the nodes, the slurm envars for those two mpirun instances -need- to be quite different. Having mpirun forward the values it sees would cause the system to become very confused.
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to provide your desired envars on the remote nodes. This is the recommended (by slurm and other RMs) way of getting what you requested. It is a simple configuration option - if he needs help, he should contact the slurm mailing list
(b) you can ask mpirun to do so, at your own risk. Specify each parameter with a "-x FOO" argument. See "man mpirun" for details. Keep an eye out for aberrant behavior.
Ralph
Hi Everyone, I have an OpenMPI/SLURM specific question,
I’m using MPI as a launcher for another application I’m working on and it is dependent on the SLURM environment variables making their way into the a.out’s environment. This works as I need if I use HP-MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I would like across all of the ranks.
I have example output below from a simple a.out that just writes out the environment that it sees to a file whose name is based on the node name and rank number. Note that with OpenMPI, that things like SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes and things like SLURM_LOCALID are just missing entirely.
So the question is, should the environment variables on the remote nodes (from the perspective of where the job is launched) have the full set of SLURM environment variables as seen on the launching node?
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Ralph Castain
2011-02-24 15:41:06 UTC
Permalink
Post by Jeff Squyres
The weird thing is that when running his test, he saw different results
with HP MPI vs. Open MPI.
It sounded quite likely that HP MPI is picking up and moving the envars
itself - that possibility was implied, but not clearly stated.
Post by Jeff Squyres
What his test didn't say was whether those were the same exact nodes or
not. It would be good to repeat my experiment with the same exact nodes
(e.g., inside one SLURM salloc job, or use the -w param to specify the same
nodes for salloc for OMPI and srun for HP MPI).
We should note that you -can- directly srun an OMPI job now. I believe that
capability was released in the 1.5 series. It takes a minimum slurm release
level plus a slurm configuration setting to do so.
Post by Jeff Squyres
Post by Ralph Castain
Like I said, this isn't an OMPI problem. You have your slurm configured
to pass certain envars to the remote nodes, and Brent doesn't. It truly is
just that simple.
Post by Ralph Castain
I've seen this before with other slurm installations. Which envars get
set on the backend is configurable, that's all.
Post by Ralph Castain
Has nothing to do with OMPI.
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for comparison,
later.
Post by Ralph Castain
$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults to
running one process per core (vs. srun's default of running one per node).
So let's tone mpirun down to use one process per node and look for the
SLURM_ env variables.
Post by Ralph Castain
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) -- these
differences are expected.
Post by Ralph Castain
What version of OMPI are you running? What happens if you repeat this
experiment?
Post by Ralph Castain
I would find it very strange if Open MPI's mpirun is filtering some SLURM
env variables to some processes and not to all -- your output shows
disparate output between the different processes. That's just plain weird.
SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
Post by Ralph Castain
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use salloc
– right? In this case, it is up to OpenMPI to interpret the SLURM
environment variables it sees in the one process that is launched and ‘do
the right thing’ – whatever that means in this case. How does OpenMPI start
the processes on the remote nodes under the covers? (use srun, generate a
hostfile and launch as you would outside SLURM, …) This may be the
difference between HP-MPI and OpenMPI.
Post by Ralph Castain
Thanks,
Brent
On Behalf Of Ralph Castain
Post by Ralph Castain
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program passing
RM-managed envars from one node to another, and this is certainly true of
slurm. The reason is that the RM reserves those values for its own use when
managing remote nodes. For example, if you got an allocation and then used
mpirun to launch a job across only a portion of that allocation, and then
ran another mpirun instance in parallel on the remainder of the nodes, the
slurm envars for those two mpirun instances -need- to be quite different.
Having mpirun forward the values it sees would cause the system to become
very confused.
Post by Ralph Castain
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to
provide your desired envars on the remote nodes. This is the recommended (by
slurm and other RMs) way of getting what you requested. It is a simple
configuration option - if he needs help, he should contact the slurm mailing
list
Post by Ralph Castain
(b) you can ask mpirun to do so, at your own risk. Specify each
parameter with a "-x FOO" argument. See "man mpirun" for details. Keep an
eye out for aberrant behavior.
Post by Ralph Castain
Ralph
On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent <
Hi Everyone, I have an OpenMPI/SLURM specific question,
I’m using MPI as a launcher for another application I’m working on and
it is dependent on the SLURM environment variables making their way into the
a.out’s environment. This works as I need if I use HP-MPI/PMPI, but when I
use OpenMPI, it appears that not all are set as I would like across all of
the ranks.
Post by Ralph Castain
I have example output below from a simple a.out that just writes out
the environment that it sees to a file whose name is based on the node name
and rank number. Note that with OpenMPI, that things like SLURM_NNODES and
SLURM_TASKS_PER_NODE are not set the same for ranks on the different nodes
and things like SLURM_LOCALID are just missing entirely.
Post by Ralph Castain
So the question is, should the environment variables on the remote
nodes (from the perspective of where the job is launched) have the full set
of SLURM environment variables as seen on the launching node?
Post by Ralph Castain
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
./printenv.hpmpi
Post by Ralph Castain
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node1.1.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER' hpmpi/node2.3.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Henderson, Brent
2011-02-24 15:55:43 UTC
Permalink
I'm running OpenMPI v1.4.3 and slurm v2.2.1. I built both with the default configuration except setting the prefix. The tests were run on the exact same nodes (I only have two).

When I run the test you outline below, I am still missing a bunch of env variables with OpenMPI. I ran the extra test of using HP-MPI and they are all present as with the srun invocation. I don't know if this is my slurm setup or not, but I find this really weird. If anyone knows the magic to make the fix that Ralph is referring to, I'd appreciate a pointer.

My guess was that there is a subtle way that the launch differs between the two products. But, since it works for Jeff, maybe there really is a slurm option that I need to compile in or set to make this work the way I want. It is not as simple as HP-MPI moving the environment variables itself as some of the numbers will change per process created on the remote nodes.

Thanks,

Brent

[***@node2 mpi]$ salloc -N 2
salloc: Granted job allocation 29
[***@node2 mpi]$ srun env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
[***@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
66
[***@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=8(x2)
SLURM_JOB_ID=29
SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
SLURM_JOB_NODELIST=node[1-2]
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_JOB_NUM_NODES=2
SLURM_NODELIST=node[1-2]
[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
42 <-- note, not 66 as above!
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort > mpirun.out
[***@node2 mpi]$ diff srun.out mpirun.out
2d1
< SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
4,5d2
< SLURM_CPUS_ON_NODE=8
< SLURM_CPUS_PER_TASK=1
8d4
< SLURM_DISTRIBUTION=cyclic
10d5
< SLURM_GTIDS=1
22,23d16
< SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
< SLURM_LOCALID=0
25c18
< SLURM_NNODES=2
---
Post by Henderson, Brent
SLURM_NNODES=1
28d20
< SLURM_NODEID=1
31,35c23,24
< SLURM_NPROCS=2
< SLURM_NPROCS=2
< SLURM_NTASKS=2
< SLURM_NTASKS=2
< SLURM_PRIO_PROCESS=0
---
Post by Henderson, Brent
SLURM_NPROCS=1
SLURM_NTASKS=1
38d26
< SLURM_PROCID=1
40,56c28,35
< SLURM_SRUN_COMM_HOST=10.0.205.134
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_TASKS_PER_NODE=1(x2)
< SLURM_STEP_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=45154
SLURM_STEP_ID=5
SLURM_STEPID=5
SLURM_STEP_LAUNCHER_PORT=45154
SLURM_STEP_NODELIST=node1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
59,62c38,40
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
< SLURM_TASKS_PER_NODE=1(x2)
< SLURM_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_TASK_PID=1429
SLURM_TASKS_PER_NODE=1
SLURM_TASKS_PER_NODE=8(x2)
64,65d41
< SLURM_TOPOLOGY_ADDR=node2
< SLURM_TOPOLOGY_ADDR_PATTERN=node
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep ^SLURM_ | sort > hpmpi.out
[***@node2 mpi]$ diff srun.out hpmpi.out
20a21,22
Post by Henderson, Brent
SLURM_KILL_BAD_EXIT=1
SLURM_KILL_BAD_EXIT=1
41,48c43,50
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=33347
SLURM_SRUN_COMM_PORT=33347
SLURM_STEP_ID=8
SLURM_STEP_ID=8
SLURM_STEPID=8
SLURM_STEPID=8
SLURM_STEP_LAUNCHER_PORT=33347
SLURM_STEP_LAUNCHER_PORT=33347
59,60c61,62
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
---
Post by Henderson, Brent
SLURM_TASK_PID=1592
SLURM_TASK_PID=2590
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$ grep SLURM_PROCID srun.out
SLURM_PROCID=0
SLURM_PROCID=1
[***@node2 mpi]$ grep SLURM_PROCID mpirun.out
SLURM_PROCID=0
[***@node2 mpi]$ grep SLURM_PROCID hpmpi.out
SLURM_PROCID=0
SLURM_PROCID=1
Post by Henderson, Brent
-----Original Message-----
Behalf Of Jeff Squyres
Sent: Thursday, February 24, 2011 9:31 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
The weird thing is that when running his test, he saw different results
with HP MPI vs. Open MPI.
What his test didn't say was whether those were the same exact nodes or
not. It would be good to repeat my experiment with the same exact
nodes (e.g., inside one SLURM salloc job, or use the -w param to
specify the same nodes for salloc for OMPI and srun for HP MPI).
Post by Ralph Castain
Like I said, this isn't an OMPI problem. You have your slurm
configured to pass certain envars to the remote nodes, and Brent
doesn't. It truly is just that simple.
Post by Ralph Castain
I've seen this before with other slurm installations. Which envars
get set on the backend is configurable, that's all.
Post by Ralph Castain
Has nothing to do with OMPI.
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for
comparison, later.
Post by Ralph Castain
$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults
to running one process per core (vs. srun's default of running one per
node). So let's tone mpirun down to use one process per node and look
for the SLURM_ env variables.
Post by Ralph Castain
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) --
these differences are expected.
Post by Ralph Castain
What version of OMPI are you running? What happens if you repeat
this experiment?
Post by Ralph Castain
I would find it very strange if Open MPI's mpirun is filtering some
SLURM env variables to some processes and not to all -- your output
shows disparate output between the different processes. That's just
plain weird.
SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
Post by Ralph Castain
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use
salloc - right? In this case, it is up to OpenMPI to interpret the
SLURM environment variables it sees in the one process that is launched
and 'do the right thing' - whatever that means in this case. How does
OpenMPI start the processes on the remote nodes under the covers? (use
srun, generate a hostfile and launch as you would outside SLURM, ...)
This may be the difference between HP-MPI and OpenMPI.
Post by Ralph Castain
Thanks,
Brent
mpi.org] On Behalf Of Ralph Castain
Post by Ralph Castain
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program
passing RM-managed envars from one node to another, and this is
certainly true of slurm. The reason is that the RM reserves those
values for its own use when managing remote nodes. For example, if you
got an allocation and then used mpirun to launch a job across only a
portion of that allocation, and then ran another mpirun instance in
parallel on the remainder of the nodes, the slurm envars for those two
mpirun instances -need- to be quite different. Having mpirun forward
the values it sees would cause the system to become very confused.
Post by Ralph Castain
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to
provide your desired envars on the remote nodes. This is the
recommended (by slurm and other RMs) way of getting what you requested.
It is a simple configuration option - if he needs help, he should
contact the slurm mailing list
Post by Ralph Castain
(b) you can ask mpirun to do so, at your own risk. Specify each
parameter with a "-x FOO" argument. See "man mpirun" for details. Keep
an eye out for aberrant behavior.
Post by Ralph Castain
Ralph
On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent
Hi Everyone, I have an OpenMPI/SLURM specific question,
I'm using MPI as a launcher for another application I'm working on
and it is dependent on the SLURM environment variables making their way
into the a.out's environment. This works as I need if I use HP-
MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I
would like across all of the ranks.
Post by Ralph Castain
I have example output below from a simple a.out that just writes
out the environment that it sees to a file whose name is based on the
node name and rank number. Note that with OpenMPI, that things like
SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on
the different nodes and things like SLURM_LOCALID are just missing
entirely.
Post by Ralph Castain
So the question is, should the environment variables on the remote
nodes (from the perspective of where the job is launched) have the full
set of SLURM environment variables as seen on the launching node?
Post by Ralph Castain
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
./printenv.hpmpi
Post by Ralph Castain
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node1.1.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node2.3.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain
2011-02-24 16:05:17 UTC
Permalink
I would talk to the slurm folks about it - I don't know anything about the
internals of HP-MPI, but I do know the relevant OMPI internals. OMPI doesn't
do anything with respect to the envars. We just use "srun -hostlist <fff>"
to launch the daemons. Each daemon subsequently gets a message telling it
what local procs to run, and then fork/exec's those procs. The environment
set for those procs is a copy of that given to the daemon, including any and
all slurm values.

So whatever slurm sets, your procs get.

My guess is that HP-MPI is doing something with the envars to create the
difference.

As for running OMPI procs directly from srun: the slurm folks put out a faq
(or its equivalent) on it, I believe. I don't recall the details (even
though I wrote the integration...). If you google our user and/or devel
mailing lists, though, you'll see threads discussing it. Look for "slurmd"
in the text - that's the ORTE integration module for that feature.
Post by Henderson, Brent
I'm running OpenMPI v1.4.3 and slurm v2.2.1. I built both with the default
configuration except setting the prefix. The tests were run on the exact
same nodes (I only have two).
When I run the test you outline below, I am still missing a bunch of env
variables with OpenMPI. I ran the extra test of using HP-MPI and they are
all present as with the srun invocation. I don't know if this is my slurm
setup or not, but I find this really weird. If anyone knows the magic to
make the fix that Ralph is referring to, I'd appreciate a pointer.
My guess was that there is a subtle way that the launch differs between the
two products. But, since it works for Jeff, maybe there really is a slurm
option that I need to compile in or set to make this work the way I want.
It is not as simple as HP-MPI moving the environment variables itself as
some of the numbers will change per process created on the remote nodes.
Thanks,
Brent
salloc: Granted job allocation 29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
66
~/bin/openmpi143/bin/mpirun
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=8(x2)
SLURM_JOB_ID=29
SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
SLURM_JOB_NODELIST=node[1-2]
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_JOB_NUM_NODES=2
SLURM_NODELIST=node[1-2]
~/bin/openmpi143/bin/mpirun
42 <-- note, not 66 as above!
2d1
< SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
4,5d2
< SLURM_CPUS_ON_NODE=8
< SLURM_CPUS_PER_TASK=1
8d4
< SLURM_DISTRIBUTION=cyclic
10d5
< SLURM_GTIDS=1
22,23d16
< SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
< SLURM_LOCALID=0
25c18
< SLURM_NNODES=2
---
Post by Henderson, Brent
SLURM_NNODES=1
28d20
< SLURM_NODEID=1
31,35c23,24
< SLURM_NPROCS=2
< SLURM_NPROCS=2
< SLURM_NTASKS=2
< SLURM_NTASKS=2
< SLURM_PRIO_PROCESS=0
---
Post by Henderson, Brent
SLURM_NPROCS=1
SLURM_NTASKS=1
38d26
< SLURM_PROCID=1
40,56c28,35
< SLURM_SRUN_COMM_HOST=10.0.205.134
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_TASKS_PER_NODE=1(x2)
< SLURM_STEP_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=45154
SLURM_STEP_ID=5
SLURM_STEPID=5
SLURM_STEP_LAUNCHER_PORT=45154
SLURM_STEP_NODELIST=node1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
59,62c38,40
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
< SLURM_TASKS_PER_NODE=1(x2)
< SLURM_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_TASK_PID=1429
SLURM_TASKS_PER_NODE=1
SLURM_TASKS_PER_NODE=8(x2)
64,65d41
< SLURM_TOPOLOGY_ADDR=node2
< SLURM_TOPOLOGY_ADDR_PATTERN=node
^SLURM_ | sort > hpmpi.out
20a21,22
Post by Henderson, Brent
SLURM_KILL_BAD_EXIT=1
SLURM_KILL_BAD_EXIT=1
41,48c43,50
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=33347
SLURM_SRUN_COMM_PORT=33347
SLURM_STEP_ID=8
SLURM_STEP_ID=8
SLURM_STEPID=8
SLURM_STEPID=8
SLURM_STEP_LAUNCHER_PORT=33347
SLURM_STEP_LAUNCHER_PORT=33347
59,60c61,62
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
---
Post by Henderson, Brent
SLURM_TASK_PID=1592
SLURM_TASK_PID=2590
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=0
SLURM_PROCID=0
SLURM_PROCID=1
Post by Henderson, Brent
-----Original Message-----
Behalf Of Jeff Squyres
Sent: Thursday, February 24, 2011 9:31 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
The weird thing is that when running his test, he saw different results
with HP MPI vs. Open MPI.
What his test didn't say was whether those were the same exact nodes or
not. It would be good to repeat my experiment with the same exact
nodes (e.g., inside one SLURM salloc job, or use the -w param to
specify the same nodes for salloc for OMPI and srun for HP MPI).
Post by Ralph Castain
Like I said, this isn't an OMPI problem. You have your slurm
configured to pass certain envars to the remote nodes, and Brent
doesn't. It truly is just that simple.
Post by Ralph Castain
I've seen this before with other slurm installations. Which envars
get set on the backend is configurable, that's all.
Post by Ralph Castain
Has nothing to do with OMPI.
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for
comparison, later.
Post by Ralph Castain
$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults
to running one process per core (vs. srun's default of running one per
node). So let's tone mpirun down to use one process per node and look
for the SLURM_ env variables.
Post by Ralph Castain
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) --
these differences are expected.
Post by Ralph Castain
What version of OMPI are you running? What happens if you repeat
this experiment?
Post by Ralph Castain
I would find it very strange if Open MPI's mpirun is filtering some
SLURM env variables to some processes and not to all -- your output
shows disparate output between the different processes. That's just
plain weird.
SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
Post by Ralph Castain
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use
salloc - right? In this case, it is up to OpenMPI to interpret the
SLURM environment variables it sees in the one process that is launched
and 'do the right thing' - whatever that means in this case. How does
OpenMPI start the processes on the remote nodes under the covers? (use
srun, generate a hostfile and launch as you would outside SLURM, ...)
This may be the difference between HP-MPI and OpenMPI.
Post by Ralph Castain
Thanks,
Brent
mpi.org] On Behalf Of Ralph Castain
Post by Ralph Castain
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program
passing RM-managed envars from one node to another, and this is
certainly true of slurm. The reason is that the RM reserves those
values for its own use when managing remote nodes. For example, if you
got an allocation and then used mpirun to launch a job across only a
portion of that allocation, and then ran another mpirun instance in
parallel on the remainder of the nodes, the slurm envars for those two
mpirun instances -need- to be quite different. Having mpirun forward
the values it sees would cause the system to become very confused.
Post by Ralph Castain
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to
provide your desired envars on the remote nodes. This is the
recommended (by slurm and other RMs) way of getting what you requested.
It is a simple configuration option - if he needs help, he should
contact the slurm mailing list
Post by Ralph Castain
(b) you can ask mpirun to do so, at your own risk. Specify each
parameter with a "-x FOO" argument. See "man mpirun" for details. Keep
an eye out for aberrant behavior.
Post by Ralph Castain
Ralph
On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent
Hi Everyone, I have an OpenMPI/SLURM specific question,
I'm using MPI as a launcher for another application I'm working on
and it is dependent on the SLURM environment variables making their way
into the a.out's environment. This works as I need if I use HP-
MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I
would like across all of the ranks.
Post by Ralph Castain
I have example output below from a simple a.out that just writes
out the environment that it sees to a file whose name is based on the
node name and rank number. Note that with OpenMPI, that things like
SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on
the different nodes and things like SLURM_LOCALID are just missing
entirely.
Post by Ralph Castain
So the question is, should the environment variables on the remote
nodes (from the perspective of where the job is launched) have the full
set of SLURM environment variables as seen on the launching node?
Post by Ralph Castain
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
./printenv.hpmpi
Post by Ralph Castain
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node1.1.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node2.3.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Henderson, Brent
2011-02-24 16:15:38 UTC
Permalink
Sorry Ralph, I have to respectfully disagree with you on this one. I believe that the output below shows that the issue is that the two different MPIs launch things differently. On one node, I ran:

[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 4 --bynode sleep 300

And then checked the process tree on the remote node:

[***@node1 mpi]$ ps -fu brent
UID PID PPID C STIME TTY TIME CMD
brent 1709 1706 0 10:00 ? 00:00:00 /mnt/node1/home/brent/bin/openmpi143/bin/orted -mca
brent 1712 1709 0 10:00 ? 00:00:00 sleep 300
brent 1713 1709 0 10:00 ? 00:00:00 sleep 300
brent 1714 18458 0 10:00 pts/0 00:00:00 ps -fu brent
brent 13282 13281 0 Feb17 pts/0 00:00:00 -bash
brent 18458 13282 0 Feb23 pts/0 00:00:00 -csh
[***@node1 mpi]$ ps -fp 1706
UID PID PPID C STIME TTY TIME CMD
root 1706 1 0 10:00 ? 00:00:00 slurmstepd: [29.9]
[***@node1 mpi]$

Note that the parent of the sleep processes is orted and that orted was started by slurmstepd. Unless orted is updating the slurm variables for the children (which is doubtful) then they will not contain the specific settings that I see when I run srun directly. I launch with HP-MPI like this:

[***@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -N 2 -n 4 sleep 300

I then see the following in the process tree on the remote node:

[***@node1 mpi]$ ps -fu brent
UID PID PPID C STIME TTY TIME CMD
brent 1741 1738 0 10:02 ? 00:00:00 /bin/sleep 300
brent 1742 1738 0 10:02 ? 00:00:00 /bin/sleep 300
brent 1745 18458 0 10:02 pts/0 00:00:00 ps -fu brent
brent 13282 13281 0 Feb17 pts/0 00:00:00 -bash
brent 18458 13282 0 Feb23 pts/0 00:00:00 -csh
[***@node1 mpi]$ ps -fp 1738
UID PID PPID C STIME TTY TIME CMD
root 1738 1 0 10:02 ? 00:00:00 slurmstepd: [29.10]
[***@node1 mpi]$

Since the parent of both of the sleep processes is slurmstepd, it is setting things up as I would expect. This lineage is the same as I find by running srun directly.

Now, the question still is, why does this work for Jeff? :) Is there a way to get orted out of the way so the sleep processes are launched directly by srun?

brent




From: users-***@open-mpi.org [mailto:users-***@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, February 24, 2011 10:05 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime

I would talk to the slurm folks about it - I don't know anything about the internals of HP-MPI, but I do know the relevant OMPI internals. OMPI doesn't do anything with respect to the envars. We just use "srun -hostlist <fff>" to launch the daemons. Each daemon subsequently gets a message telling it what local procs to run, and then fork/exec's those procs. The environment set for those procs is a copy of that given to the daemon, including any and all slurm values.

So whatever slurm sets, your procs get.

My guess is that HP-MPI is doing something with the envars to create the difference.

As for running OMPI procs directly from srun: the slurm folks put out a faq (or its equivalent) on it, I believe. I don't recall the details (even though I wrote the integration...). If you google our user and/or devel mailing lists, though, you'll see threads discussing it. Look for "slurmd" in the text - that's the ORTE integration module for that feature.


On Thu, Feb 24, 2011 at 8:55 AM, Henderson, Brent <***@hp.com> wrote:
I'm running OpenMPI v1.4.3 and slurm v2.2.1. I built both with the default configuration except setting the prefix. The tests were run on the exact same nodes (I only have two).

When I run the test you outline below, I am still missing a bunch of env variables with OpenMPI. I ran the extra test of using HP-MPI and they are all present as with the srun invocation. I don't know if this is my slurm setup or not, but I find this really weird. If anyone knows the magic to make the fix that Ralph is referring to, I'd appreciate a pointer.

My guess was that there is a subtle way that the launch differs between the two products. But, since it works for Jeff, maybe there really is a slurm option that I need to compile in or set to make this work the way I want. It is not as simple as HP-MPI moving the environment variables itself as some of the numbers will change per process created on the remote nodes.

Thanks,

Brent

[***@node2 mpi]$ salloc -N 2
salloc: Granted job allocation 29
[***@node2 mpi]$ srun env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
[***@node2 mpi]$ srun env | egrep ^SLURM_ | wc -l
66
[***@node2 mpi]$ srun env | egrep ^SLURM_ | sort > srun.out
[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | head
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=8(x2)
SLURM_JOB_ID=29
SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
SLURM_JOB_NODELIST=node[1-2]
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_JOB_NUM_NODES=2
SLURM_NODELIST=node[1-2]
[***@node2 mpi]$ which mpirun
~/bin/openmpi143/bin/mpirun
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | wc -l
42 <-- note, not 66 as above!
[***@node2 mpi]$ mpirun -np 2 --bynode env | egrep ^SLURM_ | sort > mpirun.out
[***@node2 mpi]$ diff srun.out mpirun.out
2d1
< SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
4,5d2
< SLURM_CPUS_ON_NODE=8
< SLURM_CPUS_PER_TASK=1
8d4
< SLURM_DISTRIBUTION=cyclic
10d5
< SLURM_GTIDS=1
22,23d16
< SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
< SLURM_LOCALID=0
25c18
< SLURM_NNODES=2
---
Post by Henderson, Brent
SLURM_NNODES=1
28d20
< SLURM_NODEID=1
31,35c23,24
< SLURM_NPROCS=2
< SLURM_NPROCS=2
< SLURM_NTASKS=2
< SLURM_NTASKS=2
< SLURM_PRIO_PROCESS=0
---
Post by Henderson, Brent
SLURM_NPROCS=1
SLURM_NTASKS=1
38d26
< SLURM_PROCID=1
40,56c28,35
< SLURM_SRUN_COMM_HOST=10.0.205.134
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_TASKS_PER_NODE=1(x2)
< SLURM_STEP_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=45154
SLURM_STEP_ID=5
SLURM_STEPID=5
SLURM_STEP_LAUNCHER_PORT=45154
SLURM_STEP_NODELIST=node1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
59,62c38,40
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
< SLURM_TASKS_PER_NODE=1(x2)
< SLURM_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_TASK_PID=1429
SLURM_TASKS_PER_NODE=1
SLURM_TASKS_PER_NODE=8(x2)
64,65d41
< SLURM_TOPOLOGY_ADDR=node2
< SLURM_TOPOLOGY_ADDR_PATTERN=node
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$ /opt/hpmpi/bin/mpirun -srun -n 2 -N 2 env | egrep ^SLURM_ | sort > hpmpi.out
[***@node2 mpi]$ diff srun.out hpmpi.out
20a21,22
Post by Henderson, Brent
SLURM_KILL_BAD_EXIT=1
SLURM_KILL_BAD_EXIT=1
41,48c43,50
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=33347
SLURM_SRUN_COMM_PORT=33347
SLURM_STEP_ID=8
SLURM_STEP_ID=8
SLURM_STEPID=8
SLURM_STEPID=8
SLURM_STEP_LAUNCHER_PORT=33347
SLURM_STEP_LAUNCHER_PORT=33347
59,60c61,62
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
---
Post by Henderson, Brent
SLURM_TASK_PID=1592
SLURM_TASK_PID=2590
[***@node2 mpi]$
[***@node2 mpi]$
[***@node2 mpi]$ grep SLURM_PROCID srun.out
SLURM_PROCID=0
SLURM_PROCID=1
[***@node2 mpi]$ grep SLURM_PROCID mpirun.out
SLURM_PROCID=0
[***@node2 mpi]$ grep SLURM_PROCID hpmpi.out
SLURM_PROCID=0
SLURM_PROCID=1
Post by Henderson, Brent
-----Original Message-----
Behalf Of Jeff Squyres
Sent: Thursday, February 24, 2011 9:31 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
The weird thing is that when running his test, he saw different results
with HP MPI vs. Open MPI.
What his test didn't say was whether those were the same exact nodes or
not. It would be good to repeat my experiment with the same exact
nodes (e.g., inside one SLURM salloc job, or use the -w param to
specify the same nodes for salloc for OMPI and srun for HP MPI).
Post by Ralph Castain
Like I said, this isn't an OMPI problem. You have your slurm
configured to pass certain envars to the remote nodes, and Brent
doesn't. It truly is just that simple.
Post by Ralph Castain
I've seen this before with other slurm installations. Which envars
get set on the backend is configurable, that's all.
Post by Ralph Castain
Has nothing to do with OMPI.
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for
comparison, later.
Post by Ralph Castain
$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults
to running one process per core (vs. srun's default of running one per
node). So let's tone mpirun down to use one process per node and look
for the SLURM_ env variables.
Post by Ralph Castain
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) --
these differences are expected.
Post by Ralph Castain
What version of OMPI are you running? What happens if you repeat
this experiment?
Post by Ralph Castain
I would find it very strange if Open MPI's mpirun is filtering some
SLURM env variables to some processes and not to all -- your output
shows disparate output between the different processes. That's just
plain weird.
SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
Post by Ralph Castain
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use
salloc - right? In this case, it is up to OpenMPI to interpret the
SLURM environment variables it sees in the one process that is launched
and 'do the right thing' - whatever that means in this case. How does
OpenMPI start the processes on the remote nodes under the covers? (use
srun, generate a hostfile and launch as you would outside SLURM, ...)
This may be the difference between HP-MPI and OpenMPI.
Post by Ralph Castain
Thanks,
Brent
mpi.org] On Behalf Of Ralph Castain
Post by Ralph Castain
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program
passing RM-managed envars from one node to another, and this is
certainly true of slurm. The reason is that the RM reserves those
values for its own use when managing remote nodes. For example, if you
got an allocation and then used mpirun to launch a job across only a
portion of that allocation, and then ran another mpirun instance in
parallel on the remainder of the nodes, the slurm envars for those two
mpirun instances -need- to be quite different. Having mpirun forward
the values it sees would cause the system to become very confused.
Post by Ralph Castain
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to
provide your desired envars on the remote nodes. This is the
recommended (by slurm and other RMs) way of getting what you requested.
It is a simple configuration option - if he needs help, he should
contact the slurm mailing list
Post by Ralph Castain
(b) you can ask mpirun to do so, at your own risk. Specify each
parameter with a "-x FOO" argument. See "man mpirun" for details. Keep
an eye out for aberrant behavior.
Post by Ralph Castain
Ralph
On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent
Hi Everyone, I have an OpenMPI/SLURM specific question,
I'm using MPI as a launcher for another application I'm working on
and it is dependent on the SLURM environment variables making their way
into the a.out's environment. This works as I need if I use HP-
MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I
would like across all of the ranks.
Post by Ralph Castain
I have example output below from a simple a.out that just writes
out the environment that it sees to a file whose name is based on the
node name and rank number. Note that with OpenMPI, that things like
SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on
the different nodes and things like SLURM_LOCALID are just missing
entirely.
Post by Ralph Castain
So the question is, should the environment variables on the remote
nodes (from the perspective of where the job is launched) have the full
set of SLURM environment variables as seen on the launching node?
Post by Ralph Castain
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
./printenv.hpmpi
Post by Ralph Castain
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node1.1.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node2.3.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-02-24 16:20:25 UTC
Permalink
Post by Henderson, Brent
Note that the parent of the sleep processes is orted and that orted was started by slurmstepd. Unless orted is updating the slurm variables for the children (which is doubtful) then they will not contain the specific settings that I see when I run srun directly.
I'm not sure what you mean by that statement. The orted passes its environment to its children; so whatever the slurm stepd set in the environment for the orted, the children should be getting.

Clearly, something is different here -- maybe we do have a bug -- but as you stated below, why does it work for me? Is SLURM 2.2.x the difference? I don't know.
Post by Henderson, Brent
Now, the question still is, why does this work for Jeff? :) Is there a way to get orted out of the way so the sleep processes are launched directly by srun?
Yes; see Ralph's prior mail about direct srun support in Open MPI 1.5.x. You lose some functionality / features that way, though.
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Henderson, Brent
2011-02-24 19:59:30 UTC
Permalink
Post by Henderson, Brent
-----Original Message-----
Behalf Of Jeff Squyres
Sent: Thursday, February 24, 2011 10:20 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Post by Henderson, Brent
Note that the parent of the sleep processes is orted and that orted
was started by slurmstepd. Unless orted is updating the slurm
variables for the children (which is doubtful) then they will not
contain the specific settings that I see when I run srun directly.
I'm not sure what you mean by that statement. The orted passes its
environment to its children; so whatever the slurm stepd set in the
environment for the orted, the children should be getting.
While you are correct the environment is inherited to the children, sometimes that does not make sense. Take for example SLURM_PROCID. If slurmstepd starts the orted and sets its SLURM_PROCID, then the children sleep processes (of orted) would get that as well exactly as it is in orted. That is clearly misleading at best. For example:

[***@node2 mpi]$ mpirun -np 4 --bynode sleep 300

Then looking at the remote node:

[***@node1 mpi]$ ps -fu brent
UID PID PPID C STIME TTY TIME CMD
brent 2853 2850 0 13:23 ? 00:00:00 /mnt/node1/home/brent/bin/openmpi143/bin/orted -mca
brent 2856 2853 0 13:23 ? 00:00:00 sleep 300
brent 2857 2853 0 13:23 ? 00:00:00 sleep 300
(snip)

And the SLURM_PROCID from each process:

[***@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2853/environ | egrep ^SLURM_ | grep PROCID
SLURM_PROCID=0
[***@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2856/environ | egrep ^SLURM_ | grep PROCID
SLURM_PROCID=0
[***@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2857/environ | egrep ^SLURM_ | grep PROCID
SLURM_PROCID=0
[***@node1 mpi]$

They really can't be all SLURM_PROCID=0 - that is supposed to be unique for the job - right? It appears that the SLURM_PROCID is inherited from the orted parent - which makes a fair amount of sense given how things are launched. If I use HP-MPI, the slurmstepd starts each of the sleep processes and it does set SLURM_PROCID uniquely when launching each child. This is the crux of my issue.

I did find that there are OMPI_* variables that I can map internally back to what I think that the slurm variables should be:

[***@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2853/environ | egrep ^OMPI | grep WORLD
[***@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2856/environ | egrep ^OMPI | grep WORLD
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=0
[***@node1 mpi]$ perl -p -e 's/\0/\n/g' /proc/2857/environ | egrep ^OMPI | grep WORLD
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=3
OMPI_COMM_WORLD_LOCAL_RANK=1
[***@node1 mpi]$

So, I think if I combined some OMPI_* things with SLURM_* things, I should be o.k. for what I need.

Now to answer the other question - why are there some variables missing. It appears that when the orted processes are launched - via srun but only one per node, it is a subset of the main allocation and thus some of the environment variables are not the same (or missing entirely) as compared to launching them directly with srun on the full allocation. This also makes sense to me at some level, so I'm at peace with it now. :)
Post by Henderson, Brent
Clearly, something is different here -- maybe we do have a bug -- but
as you stated below, why does it work for me? Is SLURM 2.2.x the
difference? I don't know.
I'm tempted to try the older version of slurm as this might be the cause of the missing environment variables, but that is an experiment for another day. I'll see if I can make do with what I see currently.
Post by Henderson, Brent
Post by Henderson, Brent
Now, the question still is, why does this work for Jeff? :) Is
there a way to get orted out of the way so the sleep processes are
launched directly by srun?
Yes; see Ralph's prior mail about direct srun support in Open MPI
1.5.x. You lose some functionality / features that way, though.
Maybe that will be an answer, but I'll see if I can make things work with 1.4.3 for now.

Last thing before I go. Please let me apologize for not being clear on what I disagreed with Ralph about in my last note. Clearly he nailed the orted launching process and spelled it out very clearly, but I don't believe that HP-MPI is not doing anything special to copy/fix up the SLURM environment variables. Hopefully that was clear by the body of that message.

I think we are done here as I think I can make something work with the various environment variables now. Many thanks to Jeff and Ralph for their suggestions and insight on this issue!

Brent
Jeff Squyres
2011-02-24 21:08:06 UTC
Permalink
[snip]
They really can't be all SLURM_PROCID=0 - that is supposed to be unique for the job - right? It appears that the SLURM_PROCID is inherited from the orted parent - which makes a fair amount of sense given how things are launched.
That's correct, and I can agree with your sentiment.

However, our design goals were to provide a consistent *Open MPI* experience across different launchers. Providing native access to the actual underlying launcher was a secondary goal. Balancing those two, you can see why we chose the model we did: our orted provides (nearly) the same functionality across all environments.

In SLURM's case, we propagate a [seemingly] non-sensical SLURM_PROCID values to the individual processes, but only if you are making an assumption about how Open MPI is using SLURM's launcher.

More specifically, our goal is to provide consistent *Open MPI information* (e.g., through the OMPI_COMM_WORLD* env variables) -- not emulate what SLURM would have done if MPI processes had been launched individually through srun. Even more specifically: we don't think that the exact underlying launching mechanism that OMPI uses is of interest to most users; we encourage them to use our portable mechanisms that work even if they move to another cluster with a different launcher. Admittedly, that does make it a little more challenging if you have to support multiple MPI implementations, and although that's an important consideration to us, it's not our first priority.
Now to answer the other question - why are there some variables missing. It appears that when the orted processes are launched - via srun but only one per node, it is a subset of the main allocation and thus some of the environment variables are not the same (or missing entirely) as compared to launching them directly with srun on the full allocation. This also makes sense to me at some level, so I'm at peace with it now. :)
Ah, good.
Last thing before I go. Please let me apologize for not being clear on what I disagreed with Ralph about in my last note. Clearly he nailed the orted launching process and spelled it out very clearly, but I don't believe that HP-MPI is not doing anything special to copy/fix up the SLURM environment variables. Hopefully that was clear by the body of that message.
No worries; you were perfectly clear. Thanks!
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Ralph Castain
2011-02-25 01:25:55 UTC
Permalink
I guess I wasn't clear earlier - I don't know anything about how HP-MPI
works. I was only theorizing that perhaps they did something different that
results in some other slurm vars showing up in Brent's tests. From Brent's
comments, I guess they don't - but they launch jobs in a different manner
that results in some difference in the slurm envars seen by application
procs.

I don't believe we have a bug in OMPI. What we have is behavior that
reflects how the proc is launched. If an app has integrated itself tightly
with slurm, then OMPI may not be a good choice - or they can try the
"slurm-direct" launch method in 1.5 and see if that meets their needs.

There may be something going on with slurm 2.2.x - as I've said before,
slurm makes major changes in even minor releases, and trying to track them
is a nearly impossible task, especially as many of these features are
configuration dependent. What we have in OMPI is the level of slurm
integration required by the three DOE weapons labs as (a) they represent the
largest component of the very small slurm community, and (b) in the past,
they provided the majority of the slurm integration effort within ompi. It
works as they need it to, given the way they configure slurm (which may not
be how others do).

I'm always willing to help other slurm users, but within the guidelines
expressed in an earlier thread - the result must be driven by the DOE
weapons lab's requirements, and cannot interfere with their usage models.

As for slurm_procid - if an application is looking for it, it sounds like
that OMPI may not be a good choice for them. Under OMPI, slurm does not see
the application procs and has no idea they exist. Slurm's knowledge of an
OMPI job is limited solely to the daemons. This has tradeoffs, as most
design decisions do - in the case of the DOE labs, the tradeoffs were judged
favorable...at least, as far as LANL was concerned, and they were my boss
when I wrote the code :-) At LLNL's request, I did create the ability to run
jobs directly under srun - but as Jeff noted, with reduced capability.

Hope that helps clarify what is in the code, and why. I'm not sure what
motivated the original question, but hopefully ompi's slurm support is a
little bit clearer?

Ralph
Post by Henderson, Brent
[snip]
They really can't be all SLURM_PROCID=0 - that is supposed to be unique
for the job - right? It appears that the SLURM_PROCID is inherited from the
orted parent - which makes a fair amount of sense given how things are
launched.
That's correct, and I can agree with your sentiment.
However, our design goals were to provide a consistent *Open MPI*
experience across different launchers. Providing native access to the actual
underlying launcher was a secondary goal. Balancing those two, you can see
why we chose the model we did: our orted provides (nearly) the same
functionality across all environments.
In SLURM's case, we propagate a [seemingly] non-sensical SLURM_PROCID
values to the individual processes, but only if you are making an assumption
about how Open MPI is using SLURM's launcher.
More specifically, our goal is to provide consistent *Open MPI information*
(e.g., through the OMPI_COMM_WORLD* env variables) -- not emulate what SLURM
would have done if MPI processes had been launched individually through
srun. Even more specifically: we don't think that the exact underlying
launching mechanism that OMPI uses is of interest to most users; we
encourage them to use our portable mechanisms that work even if they move to
another cluster with a different launcher. Admittedly, that does make it a
little more challenging if you have to support multiple MPI implementations,
and although that's an important consideration to us, it's not our first
priority.
Now to answer the other question - why are there some variables missing.
It appears that when the orted processes are launched - via srun but only
one per node, it is a subset of the main allocation and thus some of the
environment variables are not the same (or missing entirely) as compared to
launching them directly with srun on the full allocation. This also makes
sense to me at some level, so I'm at peace with it now. :)
Ah, good.
Last thing before I go. Please let me apologize for not being clear on
what I disagreed with Ralph about in my last note. Clearly he nailed the
orted launching process and spelled it out very clearly, but I don't believe
that HP-MPI is not doing anything special to copy/fix up the SLURM
environment variables. Hopefully that was clear by the body of that
message.
No worries; you were perfectly clear. Thanks!
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-02-24 16:16:42 UTC
Permalink
FWIW, I'm running Slurm 2.1.0 -- I haven't updated to 2.2.x. yet.

Just to be sure, I re-ran my test with OMPI 1.4.3 (I was using the OMPI development SVN trunk before) and got the same results:

----
$ srun env | egrep ^SLURM_ | wc -l
144
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
----

I find it strange that "srun env ..." and HPMPI's "mpirun env..." return (effectively) the same results, but OMPI's "mpirun env ..." returns something different.

Perhaps SLURM changed something in 2.2.x...? As Ralph mentioned, OMPI *shouldn't* be altering the environment w.r.t. SLURM variables that you get -- whatever SLURM sets, that's what you should get in an OMPI-launched process.
Post by Henderson, Brent
I'm running OpenMPI v1.4.3 and slurm v2.2.1. I built both with the default configuration except setting the prefix. The tests were run on the exact same nodes (I only have two).
When I run the test you outline below, I am still missing a bunch of env variables with OpenMPI. I ran the extra test of using HP-MPI and they are all present as with the srun invocation. I don't know if this is my slurm setup or not, but I find this really weird. If anyone knows the magic to make the fix that Ralph is referring to, I'd appreciate a pointer.
My guess was that there is a subtle way that the launch differs between the two products. But, since it works for Jeff, maybe there really is a slurm option that I need to compile in or set to make this work the way I want. It is not as simple as HP-MPI moving the environment variables itself as some of the numbers will change per process created on the remote nodes.
Thanks,
Brent
salloc: Granted job allocation 29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=1(x2)
SLURM_JOB_ID=29
66
~/bin/openmpi143/bin/mpirun
SLURM_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_JOBID=29
SLURM_TASKS_PER_NODE=8(x2)
SLURM_JOB_ID=29
SLURM_SUBMIT_DIR=/mnt/node1/home/brent/src/mpi
SLURM_JOB_NODELIST=node[1-2]
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_JOB_NUM_NODES=2
SLURM_NODELIST=node[1-2]
~/bin/openmpi143/bin/mpirun
42 <-- note, not 66 as above!
2d1
< SLURM_CHECKPOINT_IMAGE_DIR=/mnt/node1/home/brent/src/mpi
4,5d2
< SLURM_CPUS_ON_NODE=8
< SLURM_CPUS_PER_TASK=1
8d4
< SLURM_DISTRIBUTION=cyclic
10d5
< SLURM_GTIDS=1
22,23d16
< SLURM_LAUNCH_NODE_IPADDR=10.0.205.134
< SLURM_LOCALID=0
25c18
< SLURM_NNODES=2
---
Post by Henderson, Brent
SLURM_NNODES=1
28d20
< SLURM_NODEID=1
31,35c23,24
< SLURM_NPROCS=2
< SLURM_NPROCS=2
< SLURM_NTASKS=2
< SLURM_NTASKS=2
< SLURM_PRIO_PROCESS=0
---
Post by Henderson, Brent
SLURM_NPROCS=1
SLURM_NTASKS=1
38d26
< SLURM_PROCID=1
40,56c28,35
< SLURM_SRUN_COMM_HOST=10.0.205.134
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NODELIST=node[1-2]
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_NODES=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_NUM_TASKS=2
< SLURM_STEP_TASKS_PER_NODE=1(x2)
< SLURM_STEP_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=45154
SLURM_STEP_ID=5
SLURM_STEPID=5
SLURM_STEP_LAUNCHER_PORT=45154
SLURM_STEP_NODELIST=node1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
59,62c38,40
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
< SLURM_TASKS_PER_NODE=1(x2)
< SLURM_TASKS_PER_NODE=1(x2)
---
Post by Henderson, Brent
SLURM_TASK_PID=1429
SLURM_TASKS_PER_NODE=1
SLURM_TASKS_PER_NODE=8(x2)
64,65d41
< SLURM_TOPOLOGY_ADDR=node2
< SLURM_TOPOLOGY_ADDR_PATTERN=node
20a21,22
Post by Henderson, Brent
SLURM_KILL_BAD_EXIT=1
SLURM_KILL_BAD_EXIT=1
41,48c43,50
< SLURM_SRUN_COMM_PORT=43247
< SLURM_SRUN_COMM_PORT=43247
< SLURM_STEP_ID=2
< SLURM_STEP_ID=2
< SLURM_STEPID=2
< SLURM_STEPID=2
< SLURM_STEP_LAUNCHER_PORT=43247
< SLURM_STEP_LAUNCHER_PORT=43247
---
Post by Henderson, Brent
SLURM_SRUN_COMM_PORT=33347
SLURM_SRUN_COMM_PORT=33347
SLURM_STEP_ID=8
SLURM_STEP_ID=8
SLURM_STEPID=8
SLURM_STEPID=8
SLURM_STEP_LAUNCHER_PORT=33347
SLURM_STEP_LAUNCHER_PORT=33347
59,60c61,62
< SLURM_TASK_PID=1381
< SLURM_TASK_PID=2288
---
Post by Henderson, Brent
SLURM_TASK_PID=1592
SLURM_TASK_PID=2590
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=0
SLURM_PROCID=0
SLURM_PROCID=1
Post by Henderson, Brent
-----Original Message-----
Behalf Of Jeff Squyres
Sent: Thursday, February 24, 2011 9:31 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
The weird thing is that when running his test, he saw different results
with HP MPI vs. Open MPI.
What his test didn't say was whether those were the same exact nodes or
not. It would be good to repeat my experiment with the same exact
nodes (e.g., inside one SLURM salloc job, or use the -w param to
specify the same nodes for salloc for OMPI and srun for HP MPI).
Post by Ralph Castain
Like I said, this isn't an OMPI problem. You have your slurm
configured to pass certain envars to the remote nodes, and Brent
doesn't. It truly is just that simple.
Post by Ralph Castain
I've seen this before with other slurm installations. Which envars
get set on the backend is configurable, that's all.
Post by Ralph Castain
Has nothing to do with OMPI.
$ salloc -N 4
$ srun env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
SLURM_PRIO_PROCESS=0
SLURM_UMASK=0002
$ srun env | egrep ^SLURM_ | wc -l
144
Good -- there's 144 of them. Let's save them to a file for
comparison, later.
Post by Ralph Castain
$ srun env | egrep ^SLURM_ | sort > srun.out
Now let's repeat the process with mpirun. Note that mpirun defaults
to running one process per core (vs. srun's default of running one per
node). So let's tone mpirun down to use one process per node and look
for the SLURM_ env variables.
Post by Ralph Castain
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | head
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_NODELIST=svbu-mpi[001-004]
SLURM_JOB_ID=95523
SLURM_JOB_NUM_NODES=4
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_JOBID=95523
SLURM_NNODES=4
SLURM_NODELIST=svbu-mpi[001-004]
SLURM_TASKS_PER_NODE=1(x4)
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | wc -l
144
Good -- we also got 144. Save them to a file.
$ mpirun -np 4 --bynode env | egrep ^SLURM_ | sort > mpirun.out
$ diff srun.out mpirun.out
93,108c93,108
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_SRUN_COMM_PORT=33571
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEP_ID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEPID=15
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
< SLURM_STEP_LAUNCHER_PORT=33571
---
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_SRUN_COMM_PORT=54184
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEP_ID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEPID=18
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
SLURM_STEP_LAUNCHER_PORT=54184
125,128c125,128
< SLURM_TASK_PID=3899
< SLURM_TASK_PID=3907
< SLURM_TASK_PID=3908
< SLURM_TASK_PID=3997
---
SLURM_TASK_PID=3924
SLURM_TASK_PID=3933
SLURM_TASK_PID=3934
SLURM_TASK_PID=4039
$
They're identical except for per-step values (ports, PIDs, etc.) --
these differences are expected.
Post by Ralph Castain
What version of OMPI are you running? What happens if you repeat
this experiment?
Post by Ralph Castain
I would find it very strange if Open MPI's mpirun is filtering some
SLURM env variables to some processes and not to all -- your output
shows disparate output between the different processes. That's just
plain weird.
SLURM_NODEID\|SLURM_PROCID\|SLURM_LOCALID | sort
Post by Ralph Castain
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=1
SLURM_NODEID=0
SLURM_NODEID=0
SLURM_NODEID=1
SLURM_NODEID=1
SLURM_PROCID=0
SLURM_PROCID=1
SLURM_PROCID=2
SLURM_PROCID=3
Since srun is not supported currently by OpenMPI, I have to use
salloc - right? In this case, it is up to OpenMPI to interpret the
SLURM environment variables it sees in the one process that is launched
and 'do the right thing' - whatever that means in this case. How does
OpenMPI start the processes on the remote nodes under the covers? (use
srun, generate a hostfile and launch as you would outside SLURM, ...)
This may be the difference between HP-MPI and OpenMPI.
Post by Ralph Castain
Thanks,
Brent
mpi.org] On Behalf Of Ralph Castain
Post by Ralph Castain
Sent: Wednesday, February 23, 2011 10:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] SLURM environment variables at runtime
Resource managers generally frown on the idea of any program
passing RM-managed envars from one node to another, and this is
certainly true of slurm. The reason is that the RM reserves those
values for its own use when managing remote nodes. For example, if you
got an allocation and then used mpirun to launch a job across only a
portion of that allocation, and then ran another mpirun instance in
parallel on the remainder of the nodes, the slurm envars for those two
mpirun instances -need- to be quite different. Having mpirun forward
the values it sees would cause the system to become very confused.
Post by Ralph Castain
We learned the hard way never to cross that line :-(
(a) you could get your sys admin to configure slurm correctly to
provide your desired envars on the remote nodes. This is the
recommended (by slurm and other RMs) way of getting what you requested.
It is a simple configuration option - if he needs help, he should
contact the slurm mailing list
Post by Ralph Castain
(b) you can ask mpirun to do so, at your own risk. Specify each
parameter with a "-x FOO" argument. See "man mpirun" for details. Keep
an eye out for aberrant behavior.
Post by Ralph Castain
Ralph
On Wed, Feb 23, 2011 at 8:38 AM, Henderson, Brent
Hi Everyone, I have an OpenMPI/SLURM specific question,
I'm using MPI as a launcher for another application I'm working on
and it is dependent on the SLURM environment variables making their way
into the a.out's environment. This works as I need if I use HP-
MPI/PMPI, but when I use OpenMPI, it appears that not all are set as I
would like across all of the ranks.
Post by Ralph Castain
I have example output below from a simple a.out that just writes
out the environment that it sees to a file whose name is based on the
node name and rank number. Note that with OpenMPI, that things like
SLURM_NNODES and SLURM_TASKS_PER_NODE are not set the same for ranks on
the different nodes and things like SLURM_LOCALID are just missing
entirely.
Post by Ralph Castain
So the question is, should the environment variables on the remote
nodes (from the perspective of where the job is launched) have the full
set of SLURM environment variables as seen on the launching node?
Post by Ralph Castain
Thanks,
Brent Henderson
salloc: Granted job allocation 23
Hello world! I'm 3 of 4 on node1
Hello world! I'm 2 of 4 on node1
Hello world! I'm 1 of 4 on node2
Hello world! I'm 0 of 4 on node2
salloc: Relinquishing job allocation 23
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node1.3.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=1
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=1
SLURM_NPROCS=1
SLURM_STEP_NODELIST=node1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
openmpi/node2.1.of.4
Post by Ralph Castain
SLURM_JOB_NODELIST=node[1-2]
SLURM_NNODES=2
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_NPROCS=4
./printenv.hpmpi
Post by Ralph Castain
Hello world! I'm 2 of 4 on node2
Hello world! I'm 3 of 4 on node2
Hello world! I'm 0 of 4 on node1
Hello world! I'm 1 of 4 on node1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node1.1.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=0
SLURM_PROCID=1
SLURM_LOCALID=1
'NODEID|NNODES|LOCALID|NODELIST|NPROCS|PROCID|TASKS_PER'
hpmpi/node2.3.of.4
Post by Ralph Castain
SLURM_NODELIST=node[1-2]
SLURM_TASKS_PER_NODE=2(x2)
SLURM_STEP_NODELIST=node[1-2]
SLURM_STEP_TASKS_PER_NODE=2(x2)
SLURM_NNODES=2
SLURM_NPROCS=4
SLURM_NODEID=1
SLURM_PROCID=3
SLURM_LOCALID=1
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Loading...