[OMPI users] what was the rationale behind rank mapping by socket?

Discussion:

David Shrader

2016-09-29 18:04:34 UTC

Hello All,

Would anyone know why the default mapping scheme is socket for jobs with
more than 2 ranks? Would they be able to please take some time and
explain the reasoning? Please note I am not railing against the
decision, but rather trying to gather as much information about it as I
can so as to be able to better work with my users who are just now
starting to ask questions about it. The FAQ pretty much pushes folks to
the man pages, and the mpirun man page doesn't go in to the reasoning.

Thank you for your time,
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader <at> lanl.gov

Gilles Gouaillardet

2016-09-30 01:52:17 UTC

Permalink

David,

i guess you would have expected the default mapping/binding scheme is
core instead of sockets

iirc, we decided *not* to bind to cores by default because it is "safer"

if you simply
OMP_NUM_THREADS=8 mpirun -np 2 a.out

then, a default mapping/binding scheme by core means the OpenMP threads
end up doing time sharing.

this is an honest mistake (8 cores per task were not requested), so
having a default mapping/binding scheme by socket means

OpenMP threads are spread on the socket and will likely not do time sharing.

/* if you run on a single socket, or if you run 4 tasks on a dual socket
nodes, then (some) tasks do share the socket,

and depending on how the OpenMP runtime is implemented, two threads of
two distinct tasks could end up bound/running on the same core */

Cheers,

Gilles

Post by David Shrader
Hello All,
Would anyone know why the default mapping scheme is socket for jobs
with more than 2 ranks? Would they be able to please take some time
and explain the reasoning? Please note I am not railing against the
decision, but rather trying to gather as much information about it as
I can so as to be able to better work with my users who are just now
starting to ask questions about it. The FAQ pretty much pushes folks
to the man pages, and the mpirun man page doesn't go in to the reasoning.
Thank you for your time,
David

Bennet Fauber

2016-09-30 01:55:59 UTC

Permalink

Pardon my naivete, but why is bind-to-none not the default, and if the
user wants to specify something, they can then get into trouble
knowingly? We have had all manner of problems with binding when using
cpusets/cgroups.

-- bennet

Post by Gilles Gouaillardet
David,
i guess you would have expected the default mapping/binding scheme is core
instead of sockets
iirc, we decided *not* to bind to cores by default because it is "safer"
if you simply
OMP_NUM_THREADS=8 mpirun -np 2 a.out
then, a default mapping/binding scheme by core means the OpenMP threads end
up doing time sharing.
this is an honest mistake (8 cores per task were not requested), so having a
default mapping/binding scheme by socket means
OpenMP threads are spread on the socket and will likely not do time sharing.
/* if you run on a single socket, or if you run 4 tasks on a dual socket
nodes, then (some) tasks do share the socket,
and depending on how the OpenMP runtime is implemented, two threads of two
distinct tasks could end up bound/running on the same core */
Cheers,
Gilles

Post by David Shrader
Hello All,
Would anyone know why the default mapping scheme is socket for jobs with
more than 2 ranks? Would they be able to please take some time and explain
the reasoning? Please note I am not railing against the decision, but rather
trying to gather as much information about it as I can so as to be able to
better work with my users who are just now starting to ask questions about
it. The FAQ pretty much pushes folks to the man pages, and the mpirun man
page doesn't go in to the reasoning.
Thank you for your time,
David

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2016-09-30 02:55:42 UTC

Permalink

Bennet,

my guess is mapping/binding to sockets was deemed the best compromise
from an

"out of the box" performance point of view.

iirc, we did fix some bugs that occured when running under asymmetric
cpusets/cgroups.

if you still have some issues with the latest Open MPI version (2.0.1)
and the default policy,

could you please describe them ?

Cheers,

Gilles

Post by Bennet Fauber
Pardon my naivete, but why is bind-to-none not the default, and if the
user wants to specify something, they can then get into trouble
knowingly? We have had all manner of problems with binding when using
cpusets/cgroups.
-- bennet

Post by David Shrader
Hello All,
Would anyone know why the default mapping scheme is socket for jobs with
more than 2 ranks? Would they be able to please take some time and explain
the reasoning? Please note I am not railing against the decision, but rather
trying to gather as much information about it as I can so as to be able to
better work with my users who are just now starting to ask questions about
it. The FAQ pretty much pushes folks to the man pages, and the mpirun man
page doesn't go in to the reasoning.
Thank you for your time,
David

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Dave Love

2016-10-11 15:09:37 UTC

Permalink

Post by Gilles Gouaillardet
Bennet,
my guess is mapping/binding to sockets was deemed the best compromise
from an
"out of the box" performance point of view.
iirc, we did fix some bugs that occured when running under asymmetric
cpusets/cgroups.
if you still have some issues with the latest Open MPI version (2.0.1)
and the default policy,
could you please describe them ?

r***@open-mpi.org

2016-10-28 23:18:10 UTC

Permalink

FWIW: Iâll be presenting âMapping, Ranking, and Binding - Oh My!â at the OMPI BoF meeting at SCâ16, for those who can attend. Will try to explain the rationale as well as the mechanics of the options

Post by Dave Love

I also don't understand why binding to sockets is the right thing to do.
Binding to cores seems the right default to me, and I set that locally,
with instructions about running OpenMP. (Isn't that what other
implementations do, which makes them look better?)
I think at least numa should be used, rather than socket. Knights
Landing, for instance, is single-socket, so no gets no actual binding by
default.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

Bennet Fauber

2016-10-28 23:50:49 UTC

Permalink

Ralph,

Alas, I will not be at SC16. I would like to hear and/or see what you
present, so if it gets made available in alternate format, I'd
appreciated know where and how to get it.

I am more and more coming to think that our cluster configuration is
essentially designed to frustrated MPI developers because we use the
scheduler to create cgroups (once upon a time, cpusets) for subsets of
cores on multisocket machines, and I think that invalidates a lot of
the assumptions that are getting made by people who want to bind to
particular patters.

It's our foot, and we have been doing a good job of shooting it. ;-)

-- bennet

FWIW: I’ll be presenting “Mapping, Ranking, and Binding - Oh My!” at the
OMPI BoF meeting at SC’16, for those who can attend. Will try to explain the
rationale as well as the mechanics of the options
Bennet,
my guess is mapping/binding to sockets was deemed the best compromise
from an
"out of the box" performance point of view.
iirc, we did fix some bugs that occured when running under asymmetric
cpusets/cgroups.
if you still have some issues with the latest Open MPI version (2.0.1)
and the default policy,
could you please describe them ?
I also don't understand why binding to sockets is the right thing to do.
Binding to cores seems the right default to me, and I set that locally,
with instructions about running OpenMP. (Isn't that what other
implementations do, which makes them look better?)
I think at least numa should be used, rather than socket. Knights
Landing, for instance, is single-socket, so no gets no actual binding by
default.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

r***@open-mpi.org

2016-10-29 00:40:49 UTC

Permalink

Yes, I’ve been hearing a growing number of complaints about cgroups for that reason. Our mapping/ranking/binding options will work with the cgroup envelope, but it generally winds up with a result that isn’t what the user wanted or expected.

We always post the OMPI BoF slides on our web site, and we’ll do the same this year. I may try to record webcast on it and post that as well since I know it can be confusing given all the flexibility we expose.

In case you haven’t read it yet, here is the relevant section from “man mpirun”:

Mapping, Ranking, and Binding: Oh My!
Open MPI employs a three-phase procedure for assigning process locations and ranks:

mapping Assigns a default location to each process

ranking Assigns an MPI_COMM_WORLD rank value to each process

binding Constrains each process to run on specific processors

The mapping step is used to assign a default location to each process based on the mapper being employed. Mapping by slot, node, and sequentially results in the
assignment of the processes to the node level. In contrast, mapping by object, allows the mapper to assign the process to an actual object on each node.

Note: the location assigned to the process is independent of where it will be bound - the assignment is used solely as input to the binding algorithm.

The mapping of process processes to nodes can be defined not just with general policies but also, if necessary, using arbitrary mappings that cannot be described by
a simple policy. One can use the "sequential mapper," which reads the hostfile line by line, assigning processes to nodes in whatever order the hostfile specifies.
Use the -mca rmaps seq option. For example, using the same hostfile as before:

mpirun -hostfile myhostfile -mca rmaps seq ./a.out

will launch three processes, one on each of nodes aa, bb, and cc, respectively. The slot counts don't matter; one process is launched per line on whatever node is
listed on the line.

Another way to specify arbitrary mappings is with a rankfile, which gives you detailed control over process binding as well. Rankfiles are discussed below.

The second phase focuses on the ranking of the process within the job's MPI_COMM_WORLD. Open MPI separates this from the mapping procedure to allow more flexibility
in the relative placement of MPI processes. This is best illustrated by considering the following two cases where we used the —map-by ppr:2:socket option:

node aa node bb

rank-by core 0 1 ! 2 3 4 5 ! 6 7

rank-by socket 0 2 ! 1 3 4 6 ! 5 7

rank-by socket:span 0 4 ! 1 5 2 6 ! 3 7

Ranking by core and by slot provide the identical result - a simple progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does a round-robin rank‐
ing within each node until all processes have been assigned an MCW rank, and then progresses to the next node. Adding the span modifier to the ranking directive
causes the ranking algorithm to treat the entire allocation as a single entity - thus, the MCW ranks are assigned across all sockets before circling back around to
the beginning.

The binding phase actually binds each process to a given set of processors. This can improve performance if the operating system is placing processes suboptimally.
For example, it might oversubscribe some multi-core processor sockets, leaving other sockets idle; this can lead processes to contend unnecessarily for common
resources. Or, it might spread processes out too widely; this can be suboptimal if application performance is sensitive to interprocess communication costs. Bind‐
ing can also keep the operating system from migrating processes excessively, regardless of how optimally those processes were placed to begin with.

The processors to be used for binding can be identified in terms of topological groupings - e.g., binding to an l3cache will bind each process to all processors
within the scope of a single L3 cache within their assigned location. Thus, if a process is assigned by the mapper to a certain socket, then a —bind-to l3cache
directive will cause the process to be bound to the processors that share a single L3 cache within that socket.

To help balance loads, the binding directive uses a round-robin method when binding to levels lower than used in the mapper. For example, consider the case where a
job is mapped to the socket level, and then bound to core. Each socket will have multiple cores, so if multiple processes are mapped to a given socket, the binding
algorithm will assign each process located to a socket to a unique core in a round-robin manner.

Alternatively, processes mapped by l2cache and then bound to socket will simply be bound to all the processors in the socket where they are located. In this manner,
users can exert detailed control over relative MCW rank location and binding.

Finally, --report-bindings can be used to report bindings.

As an example, consider a node with two processor sockets, each comprising four cores. We run mpirun with -np 4 --report-bindings and the following additional
options:

% mpirun ... --map-by core --bind-to core
[...] ... binding child [...,0] to cpus 0001
[...] ... binding child [...,1] to cpus 0002
[...] ... binding child [...,2] to cpus 0004
[...] ... binding child [...,3] to cpus 0008

% mpirun ... --map-by socket --bind-to socket
[...] ... binding child [...,0] to socket 0 cpus 000f
[...] ... binding child [...,1] to socket 1 cpus 00f0
[...] ... binding child [...,2] to socket 0 cpus 000f
[...] ... binding child [...,3] to socket 1 cpus 00f0

% mpirun ... --map-by core:PE=2 --bind-to core
[...] ... binding child [...,0] to cpus 0003
[...] ... binding child [...,1] to cpus 000c
[...] ... binding child [...,2] to cpus 0030
[...] ... binding child [...,3] to cpus 00c0

% mpirun ... --bind-to none

Here, --report-bindings shows the binding of each process as a mask. In the first case, the processes bind to successive cores as indicated by the masks 0001, 0002,
0004, and 0008. In the second case, processes bind to all cores on successive sockets as indicated by the masks 000f and 00f0. The processes cycle through the pro‐
cessor sockets in a round-robin fashion as many times as are needed. In the third case, the masks show us that 2 cores have been bound per process. In the fourth
case, binding is turned off and no bindings are reported.

Open MPI's support for process binding depends on the underlying operating system. Therefore, certain process binding options may not be available on every system.

Process binding can also be set with MCA parameters. Their usage is less convenient than that of mpirun options. On the other hand, MCA parameters can be set not
only on the mpirun command line, but alternatively in a system or user mca-params.conf file or as environment variables, as described in the MCA section below. Some
examples include:

mpirun option MCA parameter key value

--map-by core rmaps_base_mapping_policy core
--map-by socket rmaps_base_mapping_policy socket
--rank-by core rmaps_base_ranking_policy core
--bind-to core hwloc_base_binding_policy core
--bind-to socket hwloc_base_binding_policy socket
--bind-to none hwloc_base_binding_policy none

Post by Bennet Fauber
Ralph,
Alas, I will not be at SC16. I would like to hear and/or see what you
present, so if it gets made available in alternate format, I'd
appreciated know where and how to get it.
I am more and more coming to think that our cluster configuration is
essentially designed to frustrated MPI developers because we use the
scheduler to create cgroups (once upon a time, cpusets) for subsets of
cores on multisocket machines, and I think that invalidates a lot of
the assumptions that are getting made by people who want to bind to
particular patters.
It's our foot, and we have been doing a good job of shooting it. ;-)
-- bennet

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Bennet Fauber

2016-10-29 12:27:31 UTC

Permalink

Thanks, Ralph,

A video would be great to accompany the slides!

I hope you have a good and productive SC16.

-- bennet

Post by r***@open-mpi.org
Yes, I’ve been hearing a growing number of complaints about cgroups for that reason. Our mapping/ranking/binding options will work with the cgroup envelope, but it generally winds up with a result that isn’t what the user wanted or expected.
We always post the OMPI BoF slides on our web site, and we’ll do the same this year. I may try to record webcast on it and post that as well since I know it can be confusing given all the flexibility we expose.
Mapping, Ranking, and Binding: Oh My!
mapping Assigns a default location to each process
ranking Assigns an MPI_COMM_WORLD rank value to each process
binding Constrains each process to run on specific processors
The mapping step is used to assign a default location to each process based on the mapper being employed. Mapping by slot, node, and sequentially results in the
assignment of the processes to the node level. In contrast, mapping by object, allows the mapper to assign the process to an actual object on each node.
Note: the location assigned to the process is independent of where it will be bound - the assignment is used solely as input to the binding algorithm.
The mapping of process processes to nodes can be defined not just with general policies but also, if necessary, using arbitrary mappings that cannot be described by
a simple policy. One can use the "sequential mapper," which reads the hostfile line by line, assigning processes to nodes in whatever order the hostfile specifies.
mpirun -hostfile myhostfile -mca rmaps seq ./a.out
will launch three processes, one on each of nodes aa, bb, and cc, respectively. The slot counts don't matter; one process is launched per line on whatever node is
listed on the line.
Another way to specify arbitrary mappings is with a rankfile, which gives you detailed control over process binding as well. Rankfiles are discussed below.
The second phase focuses on the ranking of the process within the job's MPI_COMM_WORLD. Open MPI separates this from the mapping procedure to allow more flexibility
node aa node bb
rank-by core 0 1 ! 2 3 4 5 ! 6 7
rank-by socket 0 2 ! 1 3 4 6 ! 5 7
rank-by socket:span 0 4 ! 1 5 2 6 ! 3 7
Ranking by core and by slot provide the identical result - a simple progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does a round-robin rank‐
ing within each node until all processes have been assigned an MCW rank, and then progresses to the next node. Adding the span modifier to the ranking directive
causes the ranking algorithm to treat the entire allocation as a single entity - thus, the MCW ranks are assigned across all sockets before circling back around to
the beginning.
The binding phase actually binds each process to a given set of processors. This can improve performance if the operating system is placing processes suboptimally.
For example, it might oversubscribe some multi-core processor sockets, leaving other sockets idle; this can lead processes to contend unnecessarily for common
resources. Or, it might spread processes out too widely; this can be suboptimal if application performance is sensitive to interprocess communication costs. Bind‐
ing can also keep the operating system from migrating processes excessively, regardless of how optimally those processes were placed to begin with.
The processors to be used for binding can be identified in terms of topological groupings - e.g., binding to an l3cache will bind each process to all processors
within the scope of a single L3 cache within their assigned location. Thus, if a process is assigned by the mapper to a certain socket, then a —bind-to l3cache
directive will cause the process to be bound to the processors that share a single L3 cache within that socket.
To help balance loads, the binding directive uses a round-robin method when binding to levels lower than used in the mapper. For example, consider the case where a
job is mapped to the socket level, and then bound to core. Each socket will have multiple cores, so if multiple processes are mapped to a given socket, the binding
algorithm will assign each process located to a socket to a unique core in a round-robin manner.
Alternatively, processes mapped by l2cache and then bound to socket will simply be bound to all the processors in the socket where they are located. In this manner,
users can exert detailed control over relative MCW rank location and binding.
Finally, --report-bindings can be used to report bindings.
As an example, consider a node with two processor sockets, each comprising four cores. We run mpirun with -np 4 --report-bindings and the following additional
% mpirun ... --map-by core --bind-to core
[...] ... binding child [...,0] to cpus 0001
[...] ... binding child [...,1] to cpus 0002
[...] ... binding child [...,2] to cpus 0004
[...] ... binding child [...,3] to cpus 0008
% mpirun ... --map-by socket --bind-to socket
[...] ... binding child [...,0] to socket 0 cpus 000f
[...] ... binding child [...,1] to socket 1 cpus 00f0
[...] ... binding child [...,2] to socket 0 cpus 000f
[...] ... binding child [...,3] to socket 1 cpus 00f0
% mpirun ... --map-by core:PE=2 --bind-to core
[...] ... binding child [...,0] to cpus 0003
[...] ... binding child [...,1] to cpus 000c
[...] ... binding child [...,2] to cpus 0030
[...] ... binding child [...,3] to cpus 00c0
% mpirun ... --bind-to none
Here, --report-bindings shows the binding of each process as a mask. In the first case, the processes bind to successive cores as indicated by the masks 0001, 0002,
0004, and 0008. In the second case, processes bind to all cores on successive sockets as indicated by the masks 000f and 00f0. The processes cycle through the pro‐
cessor sockets in a round-robin fashion as many times as are needed. In the third case, the masks show us that 2 cores have been bound per process. In the fourth
case, binding is turned off and no bindings are reported.
Open MPI's support for process binding depends on the underlying operating system. Therefore, certain process binding options may not be available on every system.
Process binding can also be set with MCA parameters. Their usage is less convenient than that of mpirun options. On the other hand, MCA parameters can be set not
only on the mpirun command line, but alternatively in a system or user mca-params.conf file or as environment variables, as described in the MCA section below. Some
mpirun option MCA parameter key value
--map-by core rmaps_base_mapping_policy core
--map-by socket rmaps_base_mapping_policy socket
--rank-by core rmaps_base_ranking_policy core
--bind-to core hwloc_base_binding_policy core
--bind-to socket hwloc_base_binding_policy socket
--bind-to none hwloc_base_binding_policy none

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Dave Love

2016-11-07 16:58:03 UTC

Permalink

How? I don't understand as an implementor why there's a difference from
just resource manager core binding, assuming the programs don't try to
escape the binding. (I'm not saying there's nothing wrong with cgroups
in general...)

Post by r***@open-mpi.org
We always post the OMPI BoF slides on our web site, and we’ll do the same this year. I may try to record webcast on it and post that as well since I know it can be confusing given all the flexibility we expose.

I'm afraid I read that, and various versions of the code at different
times, and I've worked on resource manager core binding. I still had to
experiment to find a way to run mpi+openmp jobs correctly, in multiple
ompi versions. NEWS usually doesn't help, nor conference talks for
people who aren't there and don't know they should search beyond the
documentation. We don't even seem to be able to make reliable bug
reports as they may or may not get picked up here.

Regardless, I can't see how binding to socket can be a good default.